2012 BayesianReasoningandMachineLear

Jump to navigation Jump to search

Subject Headings: Junction Tree Algorithm, Learning as Inference, Naive Bayes, Learning with hidden variables, Bayesian model selection, Nearest neighbour classification, Unsupervised linear dimension reduction, Supervised linear dimension reduction, Linear models, Bayesian linear models, Gaussian process, Mixture models, Latent linear models, Latent ability model, Discrete-state Markov models, Continuous-state Markov models, Switching linear dynamical systems, Deterministic approximate inference.


Cited By



Machine learning methods extract value from vast data sets quickly and with modest resources. They are established tools in a wide range of industrial applications, including search engines, DNA sequencing, stock market analysis, and robot locomotion, and their use is spreading rapidly. People who know the methods have their choice of rewarding jobs. This hands-on text opens these opportunities to computer science students with modest mathematical backgrounds. It is designed for final-year undergraduates and master's students with limited background in linear algebra and calculus. Comprehensive and coherent, it develops everything from basic reasoning to advanced techniques within the framework of graphical models. Students learn more than a menu of techniques, they develop analytical and problem-solving skills that equip them for the real world. Numerous examples and exercises, both computer based and theoretical, are included in every chapter. Resources for students and instructors, including a MATLAB toolbox, are available online.


The data explosion

We live in a world that is rich in data, ever increasing in scale. This data comes from many different sources in science (bioinformatics, astronomy, physics, environmental monitoring) and commerce (customer databases, financial transactions, engine monitoring, speech recognition, surveillance, search). Possessing the knowledge as to how to process and extract value from such data is therefore a key and increasingly important skill. Our society also expects ultimately to be able to engage with computers in a natural manner so that computers can `talk' to humans, `understand' what they say and `comprehend' the visual world around them. These are difficult large-scale information processing tasks and represent grand challenges for computer science and related fields. Similarly, there is a desire to control increasingly complex systems, possibly containing many interacting parts, such as in robotics and autonomous navigation. Successfully mastering such systems requires an understanding of the processes underlying their behaviour. Processing and making sense of such large amounts of data from complex systems is therefore a pressing modern day concern and will likely remain so for the foreseeable future.

Machine Learning

Machine Learning is the study of data-driven methods capable of mimicking, understanding and aiding human and biological information processing tasks. In this pursuit, many related issues arise such as how to compress data, interpret and process it. Often these methods are not necessarily directed to mimicking directly human processing but rather to enhance it, such as in predicting the stock market or retrieving information rapidly. In this probability theory is key since inevitably our limited data and understanding of the problem forces us to address uncertainty. In the broadest sense, Machine Learning and related fields aim to `learn something useful' about the environment within which the agent operates. Machine Learning is also closely allied with Artificial Intelligence, with Machine Learning placing more emphasis on using data to drive and adapt the model.

In the early stages of Machine Learning and related areas, similar techniques were discovered in relatively isolated research communities. This book presents a unified treatment via graphical models, a marriage between graph and probability theory, facilitating the transference of Machine Learning concepts between different branches of the mathematical and computational sciences.

Part I. Inference in Probabilistic Models:

1. Probabilistic reasoning

Probabilistic models explicitly take into account uncertainty and deal with our imperfect knowledge of the world. Such models are of fundamental significance in Machine Learning since our understanding of the world will always be limited by our observations and understanding. We will focus initially on using probabilistic models as a kind of expert system. …

2. Basic graph concepts

3. Belief networks

4. Graphical models

5. Efficient inference in trees

6. The junction tree algorithm

7. Making decisions

Part II. Learning in Probabilistic Models:

8. Statistics for machine learning

9. Learning as inference

9.6.5 Conditional random fields

For an input x and output y, a CRF is defined by a conditional distribution [282, 180]

p(yjx) = 1 Z(x) Y k f k(y; x) (9.6.53)

for (positive) potentials �k(y; x). To make learning more straightforward, the potentials are usually defined as exp (�kfk(y; x)) for fixed functions f(y; x) and parameters �k. In this case the distribution of the output conditioned on the input is

p(y | x) = 1 / Z(x; �) Y k exp (\lambda_k f_k(y,x)) (9.6.54)

CRFs can also be viewed simply as Markov networks with exponential form potentials, as in section(9.6.4). Equation(9.6.54) is equivalent to equation (9.6.39) where the parameters [math]\displaystyle{ \theta }[/math] are here denoted by [math]\displaystyle{ \lambda }[/math] and the variables x are here denoted by y. In the CRF case the inputs x simply have the effect of determining the feature fk(y; x).

For an i.i.d. dataset of input-outputs, D = f(xn; yn); n = 1; : : : ;Ng, training based on conditional maximum likelihood requires the maximisation of

L(\lambda) XN n=1 log p(ynjxn; ) = XN n=1 X k �kfk(yn; xn) 􀀀 log Z(xn; �) (9.6.55)

In general no closed form solution for the optimal [math]\displaystyle{ \lambda }[/math] exists and this needs to be determined numerically. The methods we've discussed, such as iterative scaling may be readily adapted to this optimisation problem, although in practice gradient based techniques are to be preferred[211]. For completeness we describe gradient based training. The gradient has components

@ @�i L = X n � fi(yn; xn) 􀀀 hfi(y; xn)ip(yjxn;�) �(9.6.56)

The terms hfi(y; xn)ip(yjxn;) can be problematic and their tractability depends on the structure of the potentials. For a multivariate y, provided the structure of the cliques defined on subsets of y is singly-connected, then computing the average is generally tractable. More generally, provided the cliques of the resulting junction tree have limited width, then exact marginals are available. An example of this is given for a linear-chain CRF in section(23.4.3) { see also example(9.11) below.

Another quantity often useful for numerical optimisation is the Hessian which has components

@2 @�i@�j L = X n (hfi(y; xn)i hfj(y; xn)i 􀀀 hfi(y; xn)fj(y; xn)i) (9.6.57)

where the averages above are with respect to p(yjxn; �). This expression is a (negated) sum of covariance elements, and is therefore negative (semi) definite. Hence the function [math]\displaystyle{ L(\lambda) }[/math] is concave and has only a single global optimum. In practice CRFs often have many thousands if not millions of parameters [math]\displaystyle{ \lambda }[/math] so that computing the Newton update associated with the inverse of the Hessian may be prohibitively expensive. In this case alternative optimisation methods such as conjugate gradients are preferable.

In practice regularisation terms are often added to prevent overfitting (see section(13.2.2) for a discussion of regularisation) . Using a term

􀀀 X k c2 k�2 k (9.6.58)

for positive regularisation constants c2 k discourages the weights � from being too large. This term is also negative definite and hence the overall objective function remains concave.

Once trained a CRF can be used for predicting the output distribution for a novel input x�. The most likely output y� is equivalently given by y� = argmax y log p(yjx�) = argmax y X k �kfk(y; x�) 􀀀 log Z(x�; �) (9.6.59) Since the normalisation term is independent of y, finding the most likely output is equivalent to y� = argmax y X k �kfk(y; x�) (9.6.60)

Natural language processing

In a natural language processing application, xt might represent a word and yt a corresponding linguistic tag (`noun',`verb', etc.). A more suitable form in this case is to constrain the CRF to be of the form

exp X k �kgk(yt; yt􀀀1) + X l �lhl(yt; xt) ! (9.6.61)

for binary functions gk and hl and parameters �k and fil. The grammatical structure of tag-tag transitions is encoded in gk(yt; yt􀀀1) and linguistic tag information in hk(yt; xt), with the importance of these being determined by the corresponding parameters [180]. In this case inference of the marginals hytyt􀀀1jx1:T i is straightforward since the factor graph corresponding to the inference problem is a linear chain.

Variants of the linear chain CRF are used heavily in natural language processing, including part-of-speech tagging and machine translation (in which the input sequence x represents a sentence say in English and the output sequence y the corresponding translation into French). See, for example, [229].

10. Naive Bayes

11. Learning with hidden variables

12. Bayesian model selection

Part III. Machine Learning:

13. Machine learning concepts

14. Nearest neighbour classification

15. Unsupervised linear dimension reduction

16. Supervised linear dimension reduction

17. Linear models

18. Bayesian linear models

19. Gaussian processes

20. Mixture models

21. Latent linear models

22. Latent ability models

Part IV. Dynamical Models:

23. Discrete-state Markov models

24. Continuous-state Markov models

25. Switching linear dynamical systems

26. Distributed computation

Part V. Approximate Inference:

27. Sampling

28. Deterministic approximate inference



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2012 BayesianReasoningandMachineLearDavid BarberBayesian Reasoning and Machine Learning2012