- (Murphy, 1998) ⇒ Kevin P. Murphy. (1998). “A Brief Introduction to Graphical Models and Bayesian Networks." Web tutorial.
Subject Headings: Bayesian Networks.
- Representation, or, what exactly is a graphical model?
- Inference, or, how can we use these models to efficiently answer probabilistic queries?
- Learning, or, what do we do if we don't know what the model is?
- Decision theory, or, what happens when it is time to convert beliefs into actions?
- Applications, or, what's this all good for, anyway?
Probabilistic graphical models are graphs in which nodes represent random variables, and the (lack of) arcs represent conditional independence assumptions. Hence they provide a compact representation of joint probability distributions. Undirected graphical models, also called Markov Random Fields (MRFs) or Markov networks, have a simple definition of independence: two (sets of) nodes A and B are conditionally independent given a third set, C, if all paths between the nodes in A and B are separated by a node in C. By contrast, directed graphical models also called Bayesian Networks or Belief Networks (BNs), have a more complicated notion of independence, which takes into account the directionality of the arcs, as we explain below.
Dynamic Bayesian Networks (DBNs) are directed graphical models of stochastic processes. They generalise hidden Markov models (HMMs) and linear dynamical systems (LDSs) by representing the hidden (and observed) state in terms of state variables, which can have complex interdependencies. The graphical structure provides an easy way to specify these conditional independencies, and hence to provide a compact parameterization of the model.
Note that “temporal Bayesian network” would be a better name than “dynamic Bayesian network", since it is assumed that the model structure does not change, but the term DBN has become entrenched. We also normally assume that the parameters do not change, i.e., the model is time-invariant. However, we can always add extra hidden nodes to represent the current "regime", thereby creating mixtures of models to capture periodic non-stationarities. There are some cases where the size of the state space can change over time, e.g., tracking a variable, but unknown, number of objects. In this case, we need to change the model structure over time.
- Hidden Markov Models (HMMs)
The simplest kind of DBN is a Hidden Markov Model (HMM), which has one discrete hidden node and one discrete or continuous observed node per slice. We illustrate this below. As before, circles denote continuous nodes, squares denote discrete nodes, clear means hidden, shaded means observed.
We have "unrolled" the model for 4 "time slices" -- the structure and parameters are assumed to repeat as the model is unrolled further. Hence to specify a DBN, we need to define the intra-slice topology (within a slice), the inter-slice topology (between two slices), as well as the parameters for the first two slices. (Such a two-slice temporal Bayes net is often called a 2TBN.)
Some common variants on HMMs are shown below:
HMM with Gaussian output HMM with mixture of Gaussians output Auto Regressive HMM Input-output HMM Coupled HMM Factorial HMM
A graphical model specifies a complete joint probability distribution (JPD) over all the variables. Given the JPD, we can answer all possible inference queries by marginalization (summing out over irrelevant variables), as illustrated in the introduction. However, the JPD has size [math]O(2^n)[/math], where n is the number of nodes, and we have assumed each node can have 2 states. Hence summing over the JPD takes exponential time. We now discuss more efficient methods.
One needs to specify two things to describe a BN: the graph topology (structure) and the parameters of each CPD. It is possible to learn both of these from data. However, learning structure is much harder than learning parameters. Also, learning when some of the nodes are hidden, or we have missing data, is much harder than when everything is observed. This gives rise to 4 cases:
Structure Observability Method --------------------------------------------- Known Full Maximum Likelihood Estimation Known Partial EM (or gradient ascent) Unknown Full Search through model space Unknown Partial EM + search through model space
Known structure, full observability
Known structure, partial observability
Unknown structure, full observability
Unknown structure, partial observability
It is often said that "Decision Theory = Probability Theory + Utility Theory". We have outlined above how we can model joint probability distributions in a compact way by using sparse graphs to reflect conditional independence relationships. It is also possible to decompose multi-attribute utility functions in a similar way: we create a node for each term in the sum, which has as parents all the attributes (random variables) on which it depends; typically, the utility node(s) will have action node(s) as parents, since the utility depends both on the state of the world and the action we perform. The resulting graph is called an influence diagram. In principle, we can then use the influence diagram to compute the optimal (sequence of) action(s) to perform so as to maximimize expected utility, although this is computationally intractible for all but the smallest problems.
Classical control theory is mostly concerned with the special case where the graphical model is a Linear Dynamical System and the utility function is negative quadratic loss, e.g., consider a missile tracking an airplane: its goal is to minimize the squared distance between itself and the target. When the utility function and/or the system model becomes more complicated, traditional methods break down, and one has to use reinforcement learning to find the optimal policy (mapping from states to actions).
The most widely used Bayes Nets are undoubtedly the ones embedded in Microsoft's products, including the Answer Wizard of Office 95, the Office Assistant (the bouncy paperclip guy) of Office 97, and over 30 Technical Support Troubleshooters.
BNs originally arose out of an attempt to add probabilities to expert systems, and this is still the most common use for BNs. A famous example is QMR-DT, a decision-theoretic reformulation of the Quick Medical Reference (QMR) model.
* Daphne Koller and Nir Friedman, "Probabilistic graphical models: principles and techniques", MIT Press 2009
* Adnan Darwiche, "Modeling and reasoning with Bayesian networks", Cambridge 2009
* F. V. Jensen. "Bayesian Networks and Decision Graphs". Springer. 2001. Probably the best introductory book available.
* D. Edwards. "Introduction to Graphical Modelling", 2nd ed. Springer-Verlag. 2000. Good treatment of undirected graphical models from a statistical perspective.
* J. Pearl. "Causality". Cambridge. 2000. The definitive book on using causal DAG modeling.
* R. G. Cowell, A. P. Dawid, S. L. Lauritzen and D. J. Spiegelhalter. "Probabilistic Networks and Expert Systems". Springer-Verlag. 1999. Probably the best book available, although the treatment is restricted to exact inference.
* M. I. Jordan (ed). “Learning in Graphical Models". MIT Press. 1998. Loose collection of papers on machine learning, many related to graphical models. One of the few books to discuss approximate inference.
* B. Frey. "Graphical models for machine learning and digital communication", MIT Press. 1998. Discusses pattern recognition and turbocodes using (directed) graphical models.
* E. Castillo and J. M. Gutierrez and A. S. Hadi. "Expert systems and probabilistic network models". Springer-Verlag, 1997. A Spanish version is available online for free.
* F. Jensen. "An introduction to Bayesian Networks". UCL Press. 1996. Out of print. Superceded by his 2001 book.
* S. Lauritzen. "Graphical Models", Oxford. 1996. The definitive mathematical exposition of the theory of graphical models.
* S. Russell and P. Norvig. "Artificial Intelligence: A Modern Approach". Prentice Hall. 1995. Popular undergraduate textbook that includes a readable chapter on directed graphical models.
* J. Whittaker. "Graphical Models in Applied Multivariate Statistics", Wiley. 1990. This is the first book published on graphical modelling from a statistics perspective.
* R. Neapoliton. "Probabilistic Reasoning in Expert Systems". John Wiley & Sons. 1990.
* J. Pearl. "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference." Morgan Kaufmann. 1988. The book that got it all started! A very insightful book, still relevant today.