Chapter 26: Bayesian Networks

Цей контент ще не доступний вашою мовою.

Cast of characters

Name	Lifespan	Role
Judea Pearl	1936–	Computer scientist at UCLA; developed belief-network propagation (1986) and authored Probabilistic Reasoning in Intelligent Systems (1988), establishing probabilistic graphs as a core AI framework.
Steffen L. Lauritzen	—	Statistician; co-author of the 1988 local-computation paper on graphical probabilistic structures and their application to expert systems.
David J. Spiegelhalter	1953–	Statistician; co-author with Lauritzen (1988); helped broaden the graphical-model framework beyond Pearl’s AI context.
Gregory F. Cooper	—	AI researcher whose 1990 complexity result clarified the computational bounds of Bayesian-network inference.
Eugene Charniak	1946–2023	AI researcher; wrote the 1991 AI Magazine tutorial “Bayesian Networks without Tears,” making the framework accessible to a broad audience.

Timeline (1980s–1991)

timeline
    title Bayesian Networks, 1980s–1991
    1970s-early 1980s : Rule-based expert systems grow, using informal certainty factors and quasi-probabilistic schemes (MYCIN, PROSPECTOR, CASNET, INTERNIST) to handle uncertainty
    1986 : Pearl publishes "Fusion, propagation, and structuring in belief networks" — establishes DAG belief networks and local propagation
    1988 : Pearl publishes Probabilistic Reasoning in Intelligent Systems — book-length framework for probabilistic reasoning and belief updating by network propagation
    1988 : Lauritzen and Spiegelhalter publish local computations with probabilities on graphical structures for expert-system reasoning
    1990 : Cooper publishes a complexity result that bounds what exact Bayesian-network inference can promise
    1991 : Charniak publishes "Bayesian Networks without Tears" — accessible tutorial anchoring the framework in the AI community

Plain-words glossary

Directed acyclic graph (DAG) — A network of nodes connected by arrows where no path loops back to its starting point. In a Bayesian network, each arrow means “this node can directly influence that one.”
Conditional probability table (CPT) — A small table attached to each node listing the probability of each state that node can be in, given every combination of its parents’ states. The tables together define the model’s joint distribution.
Belief propagation — An algorithmic family for updating probability estimates across a network when new evidence arrives, using the graph to route messages.
Singly connected network — A Bayesian network in which there is at most one undirected path between any two nodes.
Conditional independence — Two variables are conditionally independent given a third when knowing the third makes the first two irrelevant to each other.
NP-hard inference — A complexity label meaning a problem is at least as hard as the hardest problems in NP.

Chapter 26: Bayesian Networks

Expert systems made knowledge explicit, but they did not make knowledge clean. The world kept arriving in fragments. A patient might have one symptom but not another. A sensor might be noisy. A rule might usually be true but fail in a minority of cases. A diagnosis might become more plausible after one test and less plausible after another. Classical rule systems could represent many logical relationships, but uncertainty put pressure on their machinery.

Bayesian networks gave AI a disciplined alternative. They did not abandon expert knowledge. They gave it a probabilistic structure. A Bayesian network is a directed acyclic graph whose nodes represent variables or propositions. The arrows encode direct dependencies. Conditional probability tables quantify how the state of a node depends on the states of its parents. Evidence can then be entered at one part of the graph and propagated through the structure.

That sounds abstract, but the historical move was concrete. Instead of storing uncertainty as scattered numerical decorations on rules, the network made dependence visible. The graph said which variables were directly connected. The tables said how strong those local relationships were. The inference algorithm used that structure to update beliefs.

Judea Pearl’s belief-network work in the 1980s helped turn this into a central AI framework. His 1986 paper described belief networks as directed acyclic graphs with nodes for propositions or variables, arcs for direct dependencies, and conditional probabilities for dependency strength. It also emphasized a computational point: in singly connected networks, evidence could be propagated locally through the graph. Pearl’s 1988 book gave the framework a larger architecture for probabilistic reasoning in intelligent systems.

The result was not “Bayes’ theorem in software” and it was not a neural network with probabilities attached. It was a different kind of infrastructure: a way to organize uncertainty so knowledge representation and computation shared the same graph.

The bargain was powerful, but not free. Building the graph required judgment. Filling the probability tables required expert elicitation or data. Exact inference could become computationally hard in general networks. Bayesian networks mattered because they made uncertainty structured, not because they made uncertainty disappear.

[!note] Pedagogical Insight: The Graph Is the Model In a Bayesian network, the arrows are not just a drawing. They encode assumptions about direct dependence, conditional independence, and the route by which evidence can change beliefs elsewhere in the model.

Rules Meet Uncertainty

The expert-system era had shown that symbolic knowledge could be operationally useful. A system could store rules, ask questions, consult a knowledge base, and produce advice. But many expert domains were not crisp. Medicine, equipment diagnosis, geology, finance, and planning all depended on partial evidence. The problem was not that experts lacked rules. It was that their rules often came with uncertainty attached.

An expert might say that a finding supports one diagnosis but does not prove it. Another observation might weaken the diagnosis without eliminating it. Two pieces of evidence might be redundant because they share a common cause. A missing observation might be uninformative in one context and suspicious in another. The rule itself was not enough. The system also needed a disciplined way to manage degrees of belief.

Early expert systems often used informal numerical methods. Lauritzen and Spiegelhalter’s 1988 paper explicitly placed exact probabilistic computation against a background where systems such as MYCIN, PROSPECTOR, CASNET, and INTERNIST used certainty factors or other quasi-probabilistic schemes. Those methods were practical attempts to cope with uncertainty, not signs of carelessness. But they left AI with a maintenance problem. If a number attached to one rule changed, what else should change? If two rules shared evidence, how should the overlap be counted? If uncertainty appeared in many places, where was the global structure?

Bayesian networks answered by moving uncertainty from isolated rules into a graph. The graph did not make expert judgment unnecessary. It forced that judgment into explicit local commitments. Which variables matter? Which variables directly depend on which others? What probabilities should be attached to those dependencies? Those questions were still hard, but they were now placed in a common representation.

This was a different style of knowledge engineering. Rule systems often felt like procedural knowledge: if this condition appears, consider that conclusion. A Bayesian network made the domain look more like a structured map of relationships. The map could be inspected. It could be criticized. It could be updated. It could also be wrong in visible ways.

That visibility was part of the point. A hidden numerical score buried inside a rule chain is difficult to reason about. A graph lets a domain expert and a programmer argue over the same object. Should a symptom point directly to a disease, or should both depend on a hidden condition? Should a test result depend on the disease alone, or also on a measurement condition? The graph turns those modeling choices into explicit infrastructure.

The new structure also changed how uncertainty moved. In a purely local rule system, evidence may trigger a chain of conclusions, but the semantics of belief can become unclear. In a Bayesian network, evidence changes probability distributions under the assumptions encoded by the graph. The update may still be expensive, approximate, or dependent on the network’s structure, but it is probabilistically grounded.

This is why Bayesian networks belong beside expert systems rather than after them as a simple replacement story. They did not say expertise was useless. They said expertise should be represented with uncertainty and dependence made explicit.

The Graph as a Memory Aid

A small diagnostic example shows the idea. Suppose a model contains a disease, a symptom, and a test result. The disease may influence the symptom. The disease may also influence the test result. The graph draws arrows from the disease to those observations. The probability table for each observation says how likely that observation is when the disease is present or absent.

If a symptom is observed, belief about the disease changes. If the test result arrives later, belief changes again. If the symptom and the test are both effects of the same disease, the model should not treat them as completely independent pieces of evidence. The graph records the dependency structure needed to avoid that kind of double counting.

This is the central representational advantage. The network separates local probability judgments from global inference. The expert supplies local relationships: how one variable depends on a few others. The inference procedure uses the network to combine those local judgments into updated beliefs across the model.

Charniak’s 1991 tutorial made this accessible by emphasizing directed acyclic graphs, random-variable nodes, neighboring probabilities, conditional probability tables, and evidence updates. A root node receives a prior probability. A nonroot node receives probabilities conditioned on its parents. Together, those local tables define a joint probability distribution under the independence assumptions of the graph.

That last phrase is important. A Bayesian network is compact because it does not require the modeler to list every possible combination of every variable directly. It uses conditional independence assumptions to reduce the burden. Instead of one enormous table over all variables, the model can use many smaller tables. The graph explains why that reduction is allowed.

The reduction is not magic. Conditional probability tables can still grow quickly when a node has many parents. If a variable depends on several multi-state parents, the table may become large and difficult to fill. This is one reason the graph is not merely a computational convenience. It is also a modeling discipline. Choosing too many parents may make the model more faithful in one sense and less maintainable in another.

The graph also creates a shared language between people and machines. A domain expert may not want to discuss a full joint probability distribution. But the same expert can often discuss whether one factor directly affects another. They can look at an arrow and ask whether it belongs. They can look at a node and ask what states it should have. They can look at a probability table and argue whether the numbers are plausible.

That is why the graph works as a memory aid. It records the domain’s dependency assumptions in a form that can be read, challenged, and reused. A rule base can also be inspected, but a Bayesian network gives uncertainty a topology. It lets the model remember not only facts, but how uncertain facts are supposed to hang together.

The name “network” can mislead modern readers because neural networks dominate current AI language. Bayesian networks are not differentiable stacks of weighted layers. Their nodes are random variables or propositions, and their arrows encode probabilistic dependence. The computation is inference over a probability model, not gradient descent over weights. The shared word hides a deeply different machine.

Conditional Independence as Compression

The deepest trick in a Bayesian network is not the arrow. It is what the absence of an arrow is allowed to mean.

If every variable in a domain could depend directly on every other variable, the model would need an enormous joint probability table. Each new variable would multiply the number of possible states. That is one reason uncertainty was so difficult to make operational. Probability theory was coherent, but a fully explicit probability distribution over a rich expert domain could be too large to write down, inspect, or compute with.

Bayesian networks use conditional independence assumptions to compress that distribution. A variable is modeled as depending directly on its parents in the graph. Other relationships may still exist, but they are mediated through the structure. If the assumptions are right, the modeler can specify smaller local tables instead of one massive global table.

Charniak’s tutorial emphasized this reduction as one of the practical reasons Bayesian networks mattered. The model still represents a joint distribution, but the graph factors it into local pieces. A root node needs a prior probability. A child node needs a table conditioned on its parents. The total model is assembled from those local commitments.

That compression is also a source of discipline. It forces the designer to say which dependencies are direct. In a medical model, a symptom might depend on a disease; a test result might depend on both the disease and the quality of the sample; two symptoms might become conditionally independent once the disease is known. Those choices are not merely technical. They are claims about the domain.

This is why the graph can be both a strength and a liability. If the graph captures the right conditional independencies, it turns an impossible table into a maintainable structure. If the graph leaves out a real dependency, the model can become confidently wrong. The compression is earned by assumptions, and assumptions can fail.

That tradeoff made Bayesian networks useful for expert systems. They did not ask experts to provide a complete probability distribution over every possible world state. They asked for local structure and local numbers. That was still hard, but it was a plausible workflow. A domain expert could focus on one relationship at a time: given these parent states, how likely is this child state?

The infrastructure lesson is subtle. Bayesian networks did not reduce uncertainty because uncertainty became simple. They reduced it by creating places to put uncertainty. The graph held structural assumptions. The tables held local numerical assumptions. The inference algorithm used both. That division of labor made probabilistic reasoning something a system could operate on rather than a philosophical ideal.

Pearl’s Propagation Problem

Pearl’s contribution was not just that he drew graphs. The infrastructure moment was that the graph could also direct computation. His 1986 belief network paper described links as both storing knowledge and directing data flow. In singly connected networks, belief updates could be propagated through local message passing, with computation tied to the graph’s paths rather than to a monolithic enumeration of every possibility.

That was a crucial shift. A probability model that cannot be computed is only a formal comfort. To matter in AI, probability had to become executable. Pearl helped show how graph structure could make updates tractable in important cases. Evidence at one node could send messages through the network. Beliefs elsewhere could change because the graph specified how variables depended on one another.

The qualifier “singly connected” must stay visible. A singly connected network is one where, roughly, there is at most one undirected path between any two nodes. That structure prevents certain feedback-like complications in the inference graph. Pearl’s local propagation result is not a universal promise that every Bayesian network can be updated cheaply. It is a result about a class of structures where the graph supports efficient local computation.

Pearl’s 1988 Probabilistic Reasoning in Intelligent Systems expanded the framework. Its chapters on Markov and Bayesian networks and on belief updating by network propagation placed probabilistic graphs at the center of a broader AI program. The book’s importance was architectural. It presented uncertainty not as an afterthought on symbolic reasoning, but as a representational and computational foundation for plausible inference.

This book-level framing matters because the 1980s AI world had competing answers to uncertainty. Some systems used certainty factors. Some used heuristic scores. Some avoided uncertainty by forcing crisp rule conditions. Pearl’s framework insisted that probability theory could be operationalized if the model was structured properly.

The word “operationalized” is doing real work here. It is easy to praise probability as the correct language of uncertainty. It is harder to make probability run inside an AI system with limited memory, limited time, and a human-maintained knowledge base. Pearl’s belief-network program tied the language of probability to a data structure and to propagation procedures. That is what made the framework infrastructural rather than merely philosophical.

That did not mean every system became Bayesian overnight. It meant AI had a clearer language for a hard class of problems. If evidence is incomplete, the system can update beliefs. If variables are conditionally independent given others, the graph can represent that compactly. If dependencies are local, the model does not need to treat every variable as directly connected to every other variable.

The historical importance is therefore partly philosophical and partly practical. Philosophically, Bayesian networks made probability respectable as a language for intelligent reasoning under uncertainty. Practically, they offered data structures and algorithms that could make that language run.

This is also where Bayesian networks connect to the surrounding chapters. Chapter 24 showed the value of turning calculus into an executable learning routine. Chapter 25 showed that hidden-layer networks had representational capacity under mathematical assumptions. Chapter 26 is a different branch of the same infrastructure story: probability became executable when model structure and computation were joined.

Expert Systems With Probability

Lauritzen and Spiegelhalter showed that Pearl was not the only route into graphical probabilistic reasoning. Their 1988 work on local computations with probabilities on graphical structures addressed expert systems directly. They argued against the idea that exact probabilistic methods were automatically too expensive for large sparse networks, and they grounded the discussion in systems such as MUNIN, a medical expert system for electromyography.

This broader context is important because it prevents a lone-genius story. Bayesian networks became influential through a convergence of AI, statistics, expert systems, and graph-based computation. Pearl is central to the chapter, but the larger movement was not simply one person’s theorem. It was a reorganization of uncertainty around graphical structure.

MUNIN is useful as a bounded example. Lauritzen and Spiegelhalter described it in the context of electromyography diagnosis, causal network structure, and conditional probability tables. The example shows both promise and cost. A graphical model can represent a medical domain in a structured way, but every node and every parent combination creates probability commitments that someone must specify, estimate, or revise.

Charniak’s later tutorial adds another bounded example through PATHFINDER, a medical-diagnosis system, and mentions software such as IDEAL and HUGIN. These examples should not be inflated into a claim that Bayesian networks took over expert systems. They show that the framework had enough practical traction to support tools, tutorials, and applications. The chapter’s safe claim is smaller and stronger: Bayesian networks gave uncertainty a reusable modeling architecture.

This architecture helped knowledge engineering by making assumptions local. If a node’s parents are fixed, the expert can reason about that node’s conditional probabilities in context. The modeler does not have to specify the entire world at once. The network decomposes the task.

But decomposition creates new responsibilities. A graph is an argument. It says some dependencies are direct and others are mediated. If that argument is wrong, the inference can be wrong. A missing arrow can hide a real dependency. An extra arrow can make the model more complex than the evidence supports. A bad probability table can make a good graph behave badly.

Bayesian networks therefore did not end knowledge engineering. They made a different kind of knowledge engineering possible. The work moved from writing only rules to designing variables, dependencies, and probability tables. This was not necessarily easier. It was more coherent.

That coherence mattered. A probability attached to a rule may be hard to interpret across a large system. A conditional probability inside a network has a defined place in the joint model. The whole system may still be approximate or incomplete, but its uncertainty has semantics.

This semantic clarity had another practical effect: it made disagreement more productive. If a rule-based system gave surprising advice, the problem might be hidden in an interaction among rules, certainty factors, and control flow. If a Bayesian network gave surprising advice, the analyst could inspect the graph, the evidence, and the relevant probability tables. The answer might still be difficult to fix, but the debugging surface was clearer. The model had an anatomy.

That anatomy mattered for maintenance. Expert systems were not written once and forgotten. Domains changed. Experts disagreed. New tests appeared. Old assumptions failed. A graphical probabilistic model gave maintainers a way to localize some of that change. A dependency could be added or removed. A table could be revised. The cost remained high, but it was no longer scattered through a long chain of loosely interpreted uncertainty numbers.

The Cost Returns

Every infrastructure breakthrough has a bill. For Bayesian networks, the bill was inference.

Pearl’s local propagation results made singly connected structures attractive, but many real domains are not simple trees. Variables can interact through multiple paths. Evidence can create dependencies between causes that were otherwise separate. Diagnostic domains often want many diseases, symptoms, tests, and contextual variables. The graph that best represents the domain may not be the graph that is easiest to compute.

Gregory Cooper’s 1990 paper made the limit explicit. Probabilistic inference in Bayesian belief networks is NP-hard in general. Cooper’s result did not make Bayesian networks useless. It made the optimism honest. Singly connected cases and restricted topologies can be efficient. General multiply connected networks can be computationally difficult. The research program therefore had to include special cases, average-case behavior, approximations, and better algorithms.

This is the same pattern seen elsewhere in AI history. A representation can be elegant and still expensive. A theorem can justify a method and still leave an engineering problem. A graph can make uncertainty structured and still require hard computation. The breakthrough is not that cost vanishes. The breakthrough is that the cost becomes visible enough to manage.

Charniak’s tutorial made the same caveat accessible: Bayesian networks reduce the burden of specifying probabilities compared with an exponential full joint table, but evaluation time remains a drawback. That is the right balance. The network buys compact representation by exploiting structure. It does not guarantee cheap inference for every structure.

This limit also protects the chapter from overclaiming. Bayesian networks did not solve uncertainty. They organized it. They made a class of probabilistic reasoning problems easier to state, inspect, and sometimes compute. They also created a long algorithmic agenda: exact inference where possible, approximate inference where necessary, and learning methods when hand-built probabilities were too costly.

The hand-built part is another cost. Conditional probability tables can be elicited from experts, estimated from data, or some combination of both. In the 1980s story, the key infrastructure was often knowledge-engineered rather than learned from massive datasets. Later work would ask how to learn parameters and structure. That later trajectory matters, but it should not be read back too strongly into the original event. The early story is representation plus inference, not big-data automation.

This is why Bayesian networks are best understood as a bridge. They joined symbolic AI’s concern with explicit structure to probability theory’s language of uncertainty and statistics’ concern with inference. They were neither old expert systems with nicer numbers nor modern deep learning in disguise. They were their own infrastructure layer.

The Bridge to Causality and Learning

Bayesian networks later became closely associated with causal reasoning, especially through Pearl’s later work. That association is real, but it should not dominate this 1980s chapter. A Bayesian network can be read as a probabilistic graphical model. A causal network adds stronger claims about intervention and mechanism. The arrows may look similar, but the interpretation is not automatically the same.

The safe historical handoff is this: the 1980s Bayesian-network framework made directed probabilistic structure central to AI reasoning under uncertainty. That structure later supported causal modeling, learning algorithms, and probabilistic graphical models more broadly. The later fields inherited a language of nodes, arrows, conditional dependencies, and inference, but they also added new assumptions and methods.

This bridge matters because it shows an alternative path through AI history. While neural networks were being rehabilitated through backpropagation and approximation theory, probabilistic AI was building a different answer to intelligence: reason carefully when the world is uncertain. The two paths would later interact, compete, and combine. But in the late 1980s, Bayesian networks gave uncertainty its own architecture.

The chapter should end on that point. Bayesian networks did not make AI omniscient. They did not make expert knowledge easy to collect. They did not make inference free. What they did was more durable: they gave uncertain knowledge a graph, gave the graph probabilities, and gave the probabilities a route for computation.

That was enough to change the field’s expectations. A serious AI system no longer had to choose between crisp symbolic rules and unstructured numerical hunches. It could represent uncertain relationships explicitly and compute with them.

Bayesian networks made uncertainty into infrastructure. The next chapter turns back toward neural networks, where the problem was not probabilistic belief but how to constrain a powerful model class into an architecture that could exploit the structure of images.

Sources

Primary

Judea Pearl, “Fusion, propagation, and structuring in belief networks”, Artificial Intelligence 29(3), 241-288 (1986): abstract-level anchor for belief networks as directed acyclic graphs with variable/proposition nodes, dependency arcs, conditional probabilities, and local propagation in singly connected networks. Full internal article passages remain access-limited.
Judea Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann (1988): book-level anchor for Markov/Bayesian networks and belief updating by network propagation; used for chapter structure and publisher-description support rather than unsourced internal quotations.
Steffen L. Lauritzen and David J. Spiegelhalter, “Local computations with probabilities on graphical structures and their application to expert systems”, Journal of the Royal Statistical Society: Series B 50(2), 157-224 (1988): expert-system uncertainty contrast, local probabilistic computation, MUNIN context, and conditional probability table infrastructure.
Gregory F. Cooper, “The computational complexity of probabilistic inference using Bayesian belief networks,” Artificial Intelligence 42(2-3), 393-405 (1990): limit-setting source for NP-hardness of general probabilistic inference, singly connected restricted cases, and the need for special-case and approximate algorithms.

Secondary

Eugene Charniak, “Bayesian Networks without Tears”, AI Magazine 12(4), 50-63 (1991): accessible tutorial anchor for DAGs, random-variable nodes, conditional probability tables, evidence updates, reduced probability specification, PATHFINDER, IDEAL/HUGIN, and evaluation time caveats.

[!note] Honesty Over Output This chapter uses Pearl 1986 and Pearl 1988 cautiously where access is abstract-level or chapter-level. Detailed claims about expert-system uncertainty, tutorial definitions, and computational limits are carried by Lauritzen/Spiegelhalter, Charniak, and Cooper, where page anchors are available in the research contract.