Bayesian Models of Brain and Behaviour

This paper presents a review of Bayesian models of brain and behaviour. We first review the basic principles of Bayesian inference. This is followed by descriptions of sampling and variational methods for approximate inference, and forward and backward recursions in time for inference in dynamical models. The review of behavioural models covers work in visual processing, sensory integration, sensorimotor integration, and collective decision making. The review of brain models covers a range of spatial scales from synapses to neurons and population codes, but with an emphasis on models of cortical hierarchies. We describe a simple hierarchical model which provides a mathematical framework relating constructs in Bayesian inference to those in neural computation. We close by reviewing recent theoretical developments in Bayesian inference for planning and control.


Introduction
This paper presents a review of Bayesian models of brain and behaviour.Overall, the aim of the paper is to review work which relates constructs in Bayesian inference to aspects of behaviour and neural computation, as outlined in Figure 1.This is a very large research area and we refer readers to standard textbooks and other review materials [1][2][3][4][5][6].
One of the main ideas to emerge in recent years is that Bayesian inference operates at the level of cortical macrocircuits.These circuits are arranged in a hierarchy which reflects the hierarchical structure of the world around us.The idea that the brain encodes a model of the world and makes predictions about its sensory input is also known as predictive coding [7].
Consider, for example, your immediate environment.It may be populated by various objects such as desks, chairs, walls, trees, and so forth.Generic attributes of this scene and the objects in it will be represented by activity in brain regions near the top of the hierarchy.The connections from higher to lower regions then encode a model of your world, describing how scenes consist of objects, and objects by their features.If a higher level representation is activated, it will activate those lower level representations that encode the presence of, for example, configurations of oriented lines that your brain expects to receive signals about in early visual cortex.
At the lowest level of the hierarchy these predictions are compared with sensory input and the difference between them, the prediction error, is propagated back up the hierarchy.This happens simultaneously at every hierarchical level.Predictions are sent down and prediction errors back up.It is important to emphasize that this is a dynamic process.Upon entering a new environment, such as a room in a house, higher level schemas will activate the likely presence of objects or people that one expects to encounter in that room.Initially, lower-level prediction errors are likely to be large.These will change activations in higher level regions, as you find that your keys were not on the kitchen table after all.Neuronal populations that initially encoded the likely presence of a key become less active.
The overall process is expressed clearly by Fletcher and Frith [8]: ". ..these systems are arranged in a hierarchy so that the prediction error emitted by a lower-level system becomes the input for a higher-level system.At the same time, feedback from the higher level system provides the prior beliefs for the lower level system.In this framework, the prediction error signal is a marker that the existing model or inference has not fully accounted for the input.A readjustment at the next level in the hierarchy may increase the accuracy and reduce the prediction error.But if it does not, higher-level readjustments are required.Higher levels provide guidance to lower levels and ensure an internal consistency of the inferred causes of sensory input at multiple levels." Predictive coding models comprised of multiple hierarchical levels are rather complex, however, when compared to much of the work in Bayesian modelling of brain and behaviour.We therefore structure our review to first focus on models of simple behaviours, and Bayesian models of simple computations in synapses, neurons, and neural populations before leading up to a more in depth review of Bayesian inference in cortical macrocircuits in Section 5.
Section 2 reviews concepts in Bayesian inference.This includes the basic principle underlying Bayes rule.For realistic models exact Bayesian inference is impossible, so we briefly describe two of the leading frameworks for approximate inference; sampling and variational methods.We also describe the temporal forward and backward recursions for inference in dynamical models.
Section 3 reviews behavioural models.This covers work in visual processing, sensory integration, sensorimotor integration, and collective decision making.Section 3.2 also describes how visual perceptions can depend on later sensory events, so-called postdiction [9].It may therefore be the case that perceptions are based on both forwards and backwards inference in time.
The review of brain models in Section 4 covers a range of spatial scales from synapses to neurons and population codes.Section 5 describes models of cortical hierarchies.This is based on early work by Mumford [10], Rao and Ballard [7] and a more recent series of papers by Friston [1,11,12].We describe a simple hierarchical model which provides a mathematical framework relating quantities in Bayesian inference to those in neural computation.Finally, we very briefly review recent theoretical developments in Bayesian inference for planning and control in Section 6 and close with a discussion in Section 7.
The main sections of the paper can be read in any order, so expert readers can skip to relevant sections.It is perhaps not necessary to fully understand the mathematical parts of Section 2, but they are included to provide a mathematical backbone onto which the discussion of models is later referred.

Bayesian Inference
It has been proposed that aspects of human behaviour are governed by statistical optimality principles, and that the brain itself is a statistical inference machine [4].In statistics the optimal way of updating your beliefs is via Bayes rule.
Consider some quantity, x.Our beliefs about the likely values of x can be described by the probability distribution p(x).If we then make a new observation y that is related to x, then we can update our belief about x using Bayesian inference.
First we need to specify the likelihood of observing y given x.This is specified by a probability distribution called the likelihood, p(y | x).It tells us, if we know x, what are the likely values of y.Our updated belief about x, that is, after observing the new data point y is given by the posterior distribution p(x | y).This can be computed via Bayes rule The denominator ensures that p(x | y) sums to 1 over all possible values of x, that is it is a probability distribution.It can be written as Equations ( 1) and ( 2) describe the basic computations that underly Bayes rule.These are multiplication, normalisation (1), and marginalisation (2).Wolpert and Ghahramani [13] use the game of tennis to illustrate key points.Imagine that you are receiving serve.One computation you need to make before returning serve is to estimate, x, the position of the ball when it first hits the ground.This scenario is depicted in Figure 2. It is possible to make an estimate solely on the basis of the balls trajectory, that is via the data y.We can find the value of x which maximises the likelihood, p(y | x).This is known as Maximum Likelihood (ML) estimation.It is also possible to estimate the uncertainty in this estimate.The ML estimate and the uncertainty in it together give rise to the likelihood distribution shown in Figure 2.
But before our opponent hits the ball we may have a fair idea as to where they will serve.It may be the case, for example, that when they serve from the right the ball tends to go down the line.We can summarise this belief by the prior distribution p(x) (shown in blue in Figure 2).We can then use Bayes rule to estimate the posterior distribution.This is the optimal combination of prior knowledge ("down the line") and new data (visual information from the ball's trajectory).Our final single best estimate of where the ball will land is then given by the maximum of the posterior density.This is known as MAP estimation (from "maximum a posteriori").
As we continue to see the ball coming toward us we can refine our belief as to where we think the ball will land.This can be implemented by applying Bayes rule recursively such that our belief at time point n depends only on our belief at the previous time point, n − 1.That is where Y n = {y 1 , y 2 , . . ., y n } denotes all observations up to time n.Our prior belief, that is, prior to observing data point y n is simply the posterior belief after observing all data points up to time n − 1, p(x n | Y n−1 ).Colloquially, we say that "today's prior is yesterday's posterior".The variable x is also referred to as a hidden variable because it is not directly observed.
If the hidden state was a discrete variable, such as whether the ball landed in or out of the service box, one can form a likelihood ratio Decisions based on the likelihood ratio are statistically optimal in the sense of having maximum sensitivity for any given level of specificity.In contexts where LR is recursively updated these decisions correspond to a sequential likelihood ratio test [14].There is a good deal of evidence showing that the firing rate of single neurons in the brain report evolving log LR values [15] (see section on "Neurons" below).

2.1.
Gaussians.If our random variables x and y are normally distributed then Bayesian inference can be implemented exactly using simple formulae.These are most easily The prior is shown in blue, the likelihood distribution in red, and the posterior distribution with the white ellipse.The maximum posterior estimate is shown by the magenta ball.This estimate can be updated in light of new information about the balls trajectory (yellow).Adapted from Wolpert and Ghahramani [13].
expressed in terms of precisions, where the precision of a random variable is its inverse variance.A precision of 10 corresponds to a variance of 0.1.We first look at inference for a single univariate measure (e.g., distance from side of tennis court).For a Gaussian prior with mean m 0 and precision λ 0 , and a Gaussian likelihood with mean m D and precision λ D the posterior distribution is Gaussian with mean m and precision λ ( So, precisions add and the posterior mean is the sum of the prior and data means, but each weighted by their relative precision.This relationship is illustrated in Figure 3. Though fairly simple, (5) shows how to optimally combine two sources of information.As we shall see in Section 3, various aspects of human behaviour from cue integration to instances of collective decision making have been shown to conform to this "normative model".Similar formulae exist for multivariate (instead of univariate) Gaussians [16] where we have multidimensional hidden states and observations, for example three-dimensional position of the ball and twodimensional landing position on court surface.

Generative Models.
So far we have discussed the relationship between a single hidden variable x and a single-observed variable y.More generally, we may have multiple hidden variables, for example, representing different levels of abstraction in cortical hierarchies, and multiple observed variables from different sensory modalities.These more complicated probabilistic relationships can be represented using probabilistic generative models and their associated graphical models [16,17].If these models do not have cycles they are referred to as Directed Acyclic Graphs (DAGs).A DAG specifies the joint probability of all variables, x = [x 1 , x 2 , . . ., x H ]. This can be written down as The posterior is closer to the likelihood than the prior because the likelihood has higher precision.Bayes rule for Gaussians has been used to explain many behaviours from sensory integration to collective decision making.
where pa[x i ] are the parents of x i .For example, for the generative model in Figure 4 we have All other probabilities can be obtained from the joint probability via marginalisation.For example, They are therefore referred to as marginal probabilities.If one of the variables is known, for example, x 1 may be a sensory input, then the marginalisation operation will produce a posterior density In hierarchical models of cortical macrocircuits, for example, x 4 may correspond to activity in a higher level brain region (see Section 5).The above equation then tells us how to estimate x 4 given sensory input x 1 .
If multiple marginal or posterior probabilities need to be computed this is most efficiently implemented using the belief propagation algorithm [18], which effectively defines an ordering on the DAG and passes the results of marginalisations between nodes.As we shall see in Section 4, a number of researchers have proposed how belief propagation can be implemented in neural circuits [19,20].
A central quantity in Bayesian modelling is the negative log likelihood of the joint density, which is often referred to as the energy

Approximate Inference.
In most interesting models there is no way to implement exact Bayesian inference.That is, for most nonlinear and/or non-Gaussian models there are no analytic formulae for computing posterior densities.Instead we must resort to approximate inference.There are two basic approaches (i) sampling methods [21], or (ii) deterministic approximation methods [16].The most popular deterministic methods are Laplace approximations or variational inference.Generally, deterministic methods are advantageous in being much faster but have the potential disadvantage of producing only locally optimal solutions.As we shall see in Section 5, it has been proposed that cortical brain regions represent information at different levels of abstraction, and that top-down connections instantiate the brains generative model of the world, and bottom-up processing its algorithm for approximate inference.We now briefly review two different approximate inference methods.

Sampling Methods.
We assume our goal is to produce samples from the multivariate posterior density p(x | y), where y is sensory data, and x are hidden variables of interest, such as activities of neurons in a network.These samples will then provide a representation of the posterior.From this, quantities such as the posterior mean can be computed by simply taking the mean of the samples.
One of the simplest sampling methods is Gibbs sampling [21] which works as follows.We pick a variable x i and generate a sample from the distribution p(x i | x \i , y), where x \i are all the other variables.We then loop over i, repeat this process a large number of times, and the samples near the end of this process (typically the last half) will be from the desired posterior p(x | y).In general, it may not be possible to easily sample from p(x i | x \i , y).This limits the applicability of the approach, but it is highly efficient for many hierarchical models [21].
A more generic procedure is Metropolis-Hastings (MH) which is a type of Markov Chain Monte Carlo (MCMC) procedure [21].MH makes use of a proposal density q(x ; x) which is dependent on the current state vector x.For symmetric q (such as a Gaussian) samples from the posterior density can be generated as follows.First, start at a point x 1 sampled from the prior, then generate a proposal x using the density q.This proposal is then accepted with probability min (1, r), where If the step is accepted we set x n+1 = x .If it is rejected we set x n+1 = x n (our list of samples can have duplicate entries).This procedure is guaranteed to produce samples from the posterior as long as we run it for long enough, and there are various criteria that can be used to monitor convergence [21].Equation (11) says we should always accept a new sample if it has higher posterior probability than the last.Because it allows occasional transitions to less probable states it can avoid locally optimal solutions.To increase the likelihood of finding globally optimal solutions it is possible to run multiple chains at different temperatures and use a proposal density to switch between them [22].We will refer to this idea again in Section 4.3.2where we suggest that the different temperatures may be controlled in the brain via neuromodulation.
These sample-based approaches were used in early neural network models such as the Boltzmann machine and the more recent Deep Belief Networks reviewed in Section 4.4.As we shall see in Section 4.3.2Gershmann et al. [23] have shown how MCMC can be used to account for perceptual multistability.

Variational Methods.
If our variables comprise sensor data y and unknown hidden variables x then we can define the free energy as where the first term is the average energy, and the average is taken with respect to the density q(x), and the second term is the entropy of q(x).Given this definition we can write the log marginal likelihood of the data as where KL( ) is the Kullback-Liebler divergence measure [24].KL is zero if the densities are equal and is otherwise positive, with larger values reflecting degree of dissimilarity.Given that the term on the left is fixed, we can minimise the KL divergence term by minimising the free energy.This will give us an approximate posterior q(x) that is optimal in the sense of minimising KL divergence with the true posterior.
To obtain a practical learning algorithm we must also ensure that the integrals in (12) are tractable.One generic procedure for attaining this goal is to assume that the approximating density factorizes over groups of variables.In physics, this is known as the mean field approximation.Thus, we consider where x i is the ith group of variables.We can also write this as where x \i denotes all variables not in the ith group.We then define the variational energy for the ith partition as I(x i ) = − q x \i log p y, x dx \i (16) and note that F is minimised when where Z is the normalisation factor needed to make q(x i ) a valid probability distribution.This gives us a recipe for approximate inference in which we update the posteriors q(x i ) in turn.This is much like Gibbs sampling, but we update sufficient statistics (e.g., mean and variance) rather than produce samples.As we described in Section 2.2, point estimates of variables, such as the MAP estimates, can be found by minimising energy.But this does not tell us about the uncertainty in these variables.To find out this uncertainty we can find the distribution q(x) that minimises the free energy.Out of all the distributions which minimise energy, the one that minimises free energy has maximal uncertainty (see (12)).That is, we are minimally committed to specific interpretations of sensory data, in accordance with Jaynes' principle of maximum entropy [24].
Readers can learn more about variational inference in standard tutorials [16,25,26].We will later refer to variational inference in the context of the Helmholtz machine [27], in Section 4.4, and the free energy principle [12,28] in Section 5.2.

Dynamic Models.
In previous sections we have considered generative models for potentially multiple and multidimensional hidden variables and observations.Going back to the tennis example, I will receive high-dimensional visual observations from which I may wish to infer two hidden variables; the two-dimensional position on court where the ball will land and the position of my opponent.
We now consider models with an explicit dynamic component.A broad class of dynamical models are the discrete time nonlinear state-space models of the form where x n are the hidden variables, y n are the observations, u n is a control input, w n is state noise, and e n is observation noise.All of these quantities are vectors.This is a Nonlinear Dynamical System (NDS) with inputs and hidden variables.The function f ( ) is a flow term which specifies the dynamics, and g( ) specifies the mapping from hidden state to observations.The above two equations define the state transition density p(x n | x n−1 ) and the observation density p(y n | x n ) (to simplify the notation we have dropped the dependence on u n , but this is implied).We denote the trajectories or sequences of observations, states, and controls using Y n = {y 1 , y 2 , . . ., y n }, X n = {x 1 , x 2 , . . ., x n }, and U n = {u 1 , u 2 , . . ., u n }.Dynamical models of the above form are important for understanding, for example, Bayesian inference as applied to sensorimotor integration, as described in Section 3.3.In this context, u n would be a copy of a motor command known as an "efference copy".The dynamical model would then allow an agent to predict the consequences of its actions.
These models can be inverted, that is, we can estimate x n from Y n using forward inference.This is depicted in Figure 5 and described mathematically in the following subsection.As we shall see in Section 4, Helmholtz has proposed that perception corresponds to unconscious statistical inference and this has become a working hypothesis for a modern generation of computational neuroscientists.Thus we have labelled inference about x n as "perception" in Figure 5.

Forward Inference.
The problem of estimating the states given current and previous observations is solved using forwards inference.This produces the marginal densities p(x n | Y n ).The forward inference problem can be solved in two steps.The first step is a Time Update or prediction step The second step is a Measurement Update or correction step which is Bayes rule with prior p(x n | Y n−1 ) from the time update, and likelihood p(y n | x n ).For Linear Dynamical Systems (LDS), where f ( ) and g( ) in (18) are linear, forward inference reduces to Kalman Filtering [29].As we shall see, Beck et al. [30] have shown how Kalman filtering can be implemented using a population of spiking neurons.For Nonlinear Dynamical Systems (NDS), approximate forward inference can be instantiated using an Extended Kalman Filter (EKF).Alternative sample-based forward inference schemes can be implemented using particle filtering.Lee and Mumford have proposed how inference in visual cortical hierarchies can proceed using particle filtering [31].

Backward Inference.
As we shall see, backward inference is important for postdiction (predictions about the past-see section on visual processing) and for planning and x N x n Figure 6: Perception as forward and backwards inference over states.Perception here corresponds to estimation of hidden state density p(x n | U N , Y N ) given known motor efference copy U N and sensory input Y N .Here, forward estimates about previous states x n (i.e., from forward inference) can be improved upon using more recent efference copy u n+1 , . . ., u N and sensory information y n+1 , . . ., y N .These so-called postdictive estimates may be useful in, for example, visual perception.control (see Section 6).We define the posterior probability of state x n given all observations up to time point N as This can be computed recursively using The first term in the integral can be thought of as a reverse flow term and is computed using Bayes rule Importantly, this form of backward inference (the so-called gamma recursions) can be implemented without requiring storage of the observations y n .These gamma recursions can therefore be implemented online, which is important for a potential neuronal implementation.Backward inference is represented graphically in Figure 6.Similar backwards recursions can be derived to estimate the control signals p(u n | x 1 , Y n ) given initial state values x 1 and desired sensory observations.This is depicted in Figure 7 and is important for planning and control as we discuss in Section 6.We envisage that backwards inference operates over short time scales for perception (tens of ms) and much longer time scales for planning and cognition.Readers can find out more about forwards and backward inference for dynamical models in standard textbooks [16].It is also worth noting that here we are referring to forward and backward recursions in time.This should not be confused with forward and backward message passing in hierarchical models as described in Section 5.

Parameter Estimation.
Dynamical systems models also depend on unknown parameters, θ.These will parameterise the dynamical function f ( ) and the observation function g( ).These parameters can be estimated using variational methods, for example, for LDS [32] or NDS [33] or using sampling methods [34,35].As we shall seee in Section 4, learning in computational models of the brain can be formulated as parameter estimation in Bayesian models.

Behavioural Models
An attractive feature of Bayesian models of behaviour is that they provide descriptions of what would be optimal for a given task.They are often referred to as "ideal observer" models because they quantify how much to update our beliefs in light of new evidence.Departures from these "normative models" can then be explained in terms of other constraints such as computational complexity or individual differences.One way to address individual differences is to use an Empirical Bayesian approach in which parameters of priors and their parametric forms are estimated from data.See [36] for an example of this approach in modelling visual motion processing.
What follows in this section is a review of Bayesian models of sensory integration, visual processing, sensorimotor integration, and collective decision making.As we shall see, the priors that we have about, for example, our visual world most readily show themselves in situations of stimulus ambiguity or at low signal-to-noise ratios.Much of the phenomenology of these perceptual illusions is long established [37], but Bayesian modelling provides new quantitative explanations and predictions.A more introductory review of much of this material is available in Frith's outstanding book on mind and brain [3].[38] considered the problem of integrating information from visual and tactile (haptic) modalities.If vision v and touch t information are independent given an object x then Bayesian fusion of sensory information produces a posterior density

Sensory Integration. Ernst and Banks
For a uniform prior p(x) and for Gaussian likelihoods, the posterior will also be a Gaussian with precision λ vt .From Bayes rule for Gaussians (5) we know that precisions add where λ v and λ t are the precision of visual and haptic senses alone, and the posterior mean is a relative-precision weighted combination or with weights w v and w t .Ernst and Banks [38] asked subjects which of two sequentially presented blocks was the taller.Subjects used either vision alone, touch alone, or a combination of the two.They recorded the accuracy with which discrimination could be made and plotted this as a function of difference in block height.This was repeated for each modality alone and then both together.They also used various levels of noise on the visual images.From the single modality discrimination curves they then fitted cumulative Gaussian density functions, which provided estimates of the precisions λ t and λ v (i) where i indexes visual noise levels.In the dual modality experiment the weighting of visual information predicted by Bayes' rule for the ith level of visual noise is This was found to match well with the empirically observed weighting of visual information.They observed visual capture at low levels of visual noise and haptic capture at high levels.Inference in this simple Bayesian model is consistent with standard signal detection theory [39], however, Bayesian inference is more general as it can accommodate, for example, nonuniform priors over block height.
There have been numerous studies of the potential role of Bayesian inference for integration of other senses.For example, object localisation using visual and auditory cues in the horizontal [40] and depth [41] planes has supported a Bayesian integration model with vision dominating audition in most ecologically valid contexts.This visual capture is the basis of the "ventriloquism" effect, but is rapidly degraded with visual noise.This literature has considered only simple inferences about single variables such as block height or spatial location.Nevertheless these studies have demonstrated a fundamental concept; that sensory integration is near Bayesoptimal.ISRN Biomathematics 3.2.Visual Processing.Kersten et al. [42] review the problem of visual object perception and argue that much of the ambiguity in visual processing, for example concerning occluded objects, can be resolved with prior knowledge.This idea is naturally embodied in a Bayesian framework [43] and has its origins in the work of Helmholtz who viewed perception as "unconscious inference."An example is how the inference of shape from shading is informed by a "lightfrom-above" prior.This results in circular patches which are darker at the bottom being perceived as convex.The adaptability of this prior, and subsequent perceptual experience, has been demonstrated by Adams et al. [44].
An example of such a Bayesian modelling approach is the work of Yu et al. [45] who propose a normative model for the Eriksen Flanker task.This simple decision making task was designed to probe neural and behavioural responses in the context of conflicting information.On each trial, three visual stimuli are presented and subjects are required to press a button depending on the identity of the central stimulus.The flanking stimuli are either congruent or incongruent.Yu et al. proposed a discrete time ideal observer model that qualitatively captured the dynamics of the decision making process.This used the recursive form of Bayes rule in (3).In later work, a continuum time limit of this model was derived [46].This produced semianalytic predictions of reaction time and error rate which provided accurate numerical fits to subject behaviour.They also proposed an algorithm for how these models could be approximately implemented in a neural network [45], which we will refer to later (see Section 5).
Weiss et al. [47] propose that many motion illusions arise from the result of Bayes-optimal processing of ecologically invalid stimuli.Their model was able to reproduce a number of psychophysical effects based on the simple assumptions that measurements are noisy and the visual system has a prior which expects slower movements to be more likely than faster ones.For example, the model could predict the direction of global motion of simple objects such as rhomboids, as a function of contrast and object shape.This model was later refined [36] by showing the prior to be non-Gaussian and subject specific, and that measurement noise variance was inversely proportional to visual contrast.
Najemnik and Geisler developed an ideal Bayesian observer model of visual search for a known target embedded in a natural texture [48].Prior beliefs in target location were updated to posterior beliefs using a likelihood term that reflected the foveated mapping properties of visual cortex.When this likelihood was matched to individual subjects discrimination ability, the resulting visual searches were nearly optimal in terms of the median number of saccades.Later work [49] showed that fixation statistics were also similar to the ideal observer.
If the world we perceive is the result of hierarchical processing in cortical networks then, because this processing may take some time (of the order of 100 ms), what is perceived to be the present could actually be the past.As this would obviously be disadvantageous for the species, it has been argued that our perceptions are based on predictive models.A 50 ms delay in processing could be accommodated by estimating the state of the world 50 ms in the future.There is much experimental evidence for this view [50].However, a purely "predictive" account fails to accommodate recent findings in visual psychophysics.The flash-lag effect, for example, is a robust visual illusion whereby a flash and a moving object that are located in the same position are perceived to be displaced from one other.If the object stops moving at the time of the flash, no such displacement is perceived.This indicates that the position of the object after the flash affects our perception of where the flash occurred.This "postdictive" account explains the phenomenon [9], and related data where the object reverses its direction at the flash time.A simple Bayesian model has been proposed to account for the activity of V4 neurons in this task [51].Later experimental work found evidence for a linear combination of both predictive and postdictive mechanisms [52].
Related phenomena include backward masking [53] and the colour-phi illusion [54].Here, two coloured dots are presented one followed quickly by the other and in close spatial proximity.This gives rise to a perception of movement and of the color changing in the middle of the apparent trajectory.Because the viewer cannot know the color of the second dot until it appears, the percept attributed to the time of the trajectory must be formed in retrospect.This postdictive account motivated Dennett [55] to propose his multiple drafts theory of consciousness.However, these phenomena are perhaps more simply explained by forwards and backwards inference in dynamic Bayesian networks (see Figure 6 and Section 2.4).

Sensorimotor Integration.
Wolpert et al. [56] have examined the use of dynamic Bayesian models, also referred to as forward models, for sensorimotor integration.These models are given generically by (18) where x n is the current state, u n is a copy of a motor command, y n are sensory observations, and w n and e n are state and observation noise.
Inference in these models proceeds as described in Section 2.4.1.First, the dynamical equation describing state transitions is integrated to create an estimate of the next state.This requires as input a copy of the current motor command (so-called efference copy) and the current state.In terms of Bayesian updates in dynamical models (see earlier) this corresponds to the time update or prediction step.A prediction of sensory input can then be made based on the predicted next state and the mapping from x n to y n .Finally, a measurement update or correction step can be applied which updates the state estimate based on current sensory input.
Wolpert et al. cite a number of key features of dynamic Bayesian models including the following.First, they allow outcomes of actions to be predicted and acted upon before sensory feedback is available.This may be important for rapid movements.Second, they use efference copy to cancel the sensory effects of movement ("reafference"), for example, the visual world is stable despite eye movements.Third, simulation of actions allows for mental rehearsal which can potentially lead to improvements in movement accuracy.
This framework was applied to the estimation of arm position using proprioceptive feedback and a forward model based on a linear dynamical system [56].Inference in this model was then implemented using a Kalman filter.The resulting bias and variance in estimates of arm position were shown to closely correspond to human performance, with proprioceptive input becoming more useful later on in the movement when predictions from the forward model were less accurate.
One of the core ideas behind these forward models is that, during (perceptual) inference, the sensory consequences of a movement are anticipated and used to attenuate the percepts related to these sensations.This mechanism reduces the predictable component of sensory input to self-generated stimuli, thereby enhancing the salience of sensations that have an external cause.This has many intriguing consequences.For example, it predicts that self-generated forces will be perceived as weaker than externally generated forces.This prediction was confirmed in a later experiment [57], thereby providing a neuroscientific explanation for force escalation during conflict; children trading tit-for-tat blows will often assert the other hit him harder.
Körding and Wolpert [58] have investigated learning in the sensorimotor system using a visual reaching task in which subjects moved their finger to a target and received visual feedback.This feedback provided information about target position that had an experimentally controlled bias and variance.Subjects were found to be able to learn this mapping (from vision to location) and integrate it into their behaviour, in a Bayes-optimal way.
Returning to our tennis theme, an analysis of three years of Wimbledon games has indicated that the outcome of the current point depends on the outcome of the previous point [59].There are multiple potential sources of correlation here.It could be that a player intermittently enjoys a sweet parameter spot where his internal sensorimotor model accurately predicts body and ball position and is able to hit the ball cleanly, or perhaps a player finds a new pattern in his opponents behaviour such as body position, or previous serve, predicting current service direction.

Collective Decision
Making.Sorkin et al. [60] have applied Bayes rule for Gaussians (see (5)) in their study of collective decision making.Here the optimal integration procedure involves each group members' input to the collective decision being weighted proportionally by the member's competence at the task.Mathematically, "competence" corresponds to precision.This model of group behaviour was shown to be better than a different model which assumed members made individual decisions which were then combined into a majority vote.This latter model better described collective decision making when members did not interact.
Bahrami et al. [61] investigated pairs of subjects (dyads) making collective perceptual decisions.Dyads with similarly sensitive subjects (similar precisions) were found to produce collective decisions that were close to optimal, but this was not the case for dyads with very different sensitivities.These observations were explained by a Bayes-optimal model under the assumption that subjects accurately communicated their confidence.This confidence sharing proved essential for the group decision to be better than the decision of the best subject.

Brain Models
We now turn to Bayesian models of the brain.As articulated by Colombo and Series [62] it could be that our behaviour is near Bayes-optimal yet the neural mechanisms underlying it are not.Current opinion on this issue is divided.According to Rust and Stocker [63] "If the system as a whole performs Bayesian inference, it seems unlikely that any one stage in this cascade represents a single component of the Bayesian model (e.g., the prior) or performs one of the mathematical operations in isolation (multiplying the prior and the likelihood)." However, the above statement may be too heavily influenced by the simplicity of the tasks which were initially used to demonstrate near Bayes-optimal behaviour for example univariate cue integration.As we shall see, the nonlinear dynamic hierarchical models underlying predictive coding models of cortical macrocircuits (Section 5) do in fact provide a close correspondence with biology [1,19,64].
The structure and function of the human brain can be studied at multiple temporal and spatial scales.Research activity at the different scales effectively constitutes different scientific disciplines, although there is a good deal of work addressing integrative and unifying perspectives [2,65,66].Our review of the literature proceeds through increasing spatial scale and a later section reviews work in modelling cortical macrocircuits.

Synapses and Dendrites.
Most models of information processing in neural circuits require that synaptic efficacies are stable at least over seconds if not minutes or hours.However, real synapses can change strength several-fold at the time scale of a single interspike interval.This is known as Short Term Synaptic Plasticity (STP) [67].Why do synapses change so quickly?
Pfister et al. [68] argue that neuronal membrane potentials are the primary locus of computational activity, where incoming information from thousands of presynaptic cells is integrated and analog state values, x are computed.It is then proposed that the goal of synaptic computation is to optimally reconstruct presynaptic membrane potentials, and optimal reconstructions are made possible via STP.Crudely, if a synapse has recently received a spike it increases its estimate of x and decreases it otherwise.Simple dynamic Bayesian models of this process explain empirical synaptic facilitation and depression.
Kiebel and Friston [69] propose that, through selective dendritic filtering, single neurons respond to specific sequences of presynaptic inputs.This study employs a dynamic Bayesian model of dendritic activity in which intracellular dendritic states are also viewed as predicting their presynaptic inputs.Pruning of dendritic spines then emerges as a consequence of parameter estimation in this model.[15] propose that categorical decisions about sensory stimuli are based on the accumulation of information over time in the form of a log likelihood ratio (see Section 2).They review experiments in which monkeys were trained to make saccades to a target depending on the perceived direction of moving dots in the centre of a screen.Firing rates of neurons in superior colliculus and lateral intraparietal regions were seen to follow this evidence accumulation model.In follow-up experiments targets appeared on the left or right with different prior probability and initial firing rates followed these priors as predicted by the accumulation model.These models are also known as drift diffusion models and are the continuous analog of the sequential likelihood ratio test [14].

Neurons. Gold and Shadlen
Fiorillo [70] proposed a general theory of neural computation based on prediction by single neurons.Each neuron is proposed to mirror the function of the whole system in learning to predict aspects of the world related to future reward.A neuron receives prior temporal information via nonsynaptic voltage-gated channels, and prior spatial information from a subset of its synaptic inputs.The remaining excitatory synaptic inputs provide current information about the state of the world.This would correspond to a "likelihood" term.The difference between expected and actual state is reflected as a prediction error signal encoded in the membrane potential of the cell.This proposal seems consistent with predictive coding theories that are formulated at a systems level (see Section 5).
Lengyel et al. [71] model storage and recall in an autoassociative model of hippocampal area CA3.The model treats recall as a problem of optimal probabilistic inference.Information is stored in the phase of cell firing relative to the hippocampal theta rhythm, a so-called spike-time code or phase code.Learning of these phase codes is based on Spike Timing Dependent Plasticity (STDP), such that a synapse is strengthened if the cell fires shortly after receiving a spike on that synapse.If the order of events is reversed the synapse is weakened.Synaptic changes only occur in a small time window, as described by an STDP curve.Given empirical STDP curves the Lengyel et al. model was able to predict the form of empirical Phase Response Curves (PRCs) underlying recall dynamics.These PRCs describe the synchronization properties of neurons.A refinement of their model [72] represented information in both spike timing and rate, and an approximate inference algorithm was developed using variational inference (see Section 2.3.2).
Deneve [20] shows that neurons that optimally integrate evidence about events in the world exhibit properties similar to integrate and fire neurons with spike-dependent adaptation (a gradually reducing firing rate).She proposes that neurons code for time-varying hidden variables, such as direction of motion, and the basic meaning of a spike is the occurrence of new information, and that propagation of spikes corresponds to Bayesian belief propagation (see Section 2).A companion paper [73] shows how neurons can learn to recognize dynamical patterns, and that successive layers of neurons can learn hierarchical models of sensory input.The learning that emerges is a form of STDP.

Probabilistic Codes.
The response of a cortical neuron to sensory input is highly variable over trials, with cells showing Poisson-like distributions of firing rates.Specifically, firing rate variances grow in proportion to mean firing rates, as would be expected from a Poisson density [74].Hoyer and Hyvarinen [75] review in vitro experiments which suggest that the variability of neuronal responses may not be a property of neurons themselves but rather emerges in intact neural circuits.This neural response variability may be a way in which neural circuits represent uncertainty.
Ma et al. [76] argue that if cells fired in the same way on every trial the brain would know exactly what the stimulus was.They suggest that the variability over a population of neurons for a single trial provides a way in which this uncertainty could be encoded in the brain, thus providing a substrate for Bayesian inference.Moreover, if the distribution of cell activities is approximately Poisson then Bayesian inference for optimal cue integration, for example, can be implemented with simple linear combinations of neural activity.They call this representation a Probabilistic Population Code (PPC).An interesting property of these codes is that sharply peaked distributions are encoded with higher firing rates (see Figure 1 in [77]).If the distribution was Gaussian this would correspond to high precision.
Ma et al. [76] concede that a deficiency of their PPC scheme is that neural activities are likely to saturate when sequential inferences are required.This can be avoided by using a nonlinearity to keep neurons within their dynamical range, which could be implemented for example using divisive normalisation [78].This idea was taken up in later work [30] which shows how populations of cells can use PPCs to implement Kalman filtering.

Sampling Codes.
A different interpretation of neural response variability is that populations of cells are implementing Bayesian inference by sampling from a posterior density [75] (see Section 2.3.1).They suggest that "variability over time" could be used whereby a "single neuron could represent a continuous distribution if its firing rate fluctuated in accordance with the distribution to be represented.At each instant in time, the instantaneous firing rate would be a random sample from the distribution to be represented."This interpretation is reviewed in [5,6] and contrasted with PPCs.
This sampling perspective provides an account of bistable perception in which multiple interpretations of ambiguous input correspond to sampling from different modes of the posterior.This may occur during bistable percepts arising from, for example, binocular rivalry or the Necker cube illusion.If stimuli are modified such that one interpretation is more natural, then it becomes dominant for longer time periods.This is consistent with Bayesian sampling where more samples are taken from dominant modes [21].The above idea was investigated empirically by placing Necker cubes against backgrounds comprised of unambiguous cubes [79].Subjects experienced modified dominance times in line with the above predictions.In experiments on binocular rivalry, where images presented to the two eyes are different, only one of them will be perceived at a given time.A switch will then occur and the other image will be perceived.For certain stimuli, subjects tend to perceive a switch as a wave propagating across the visual field.This behaviour can be readily explained by Bayesian sampling in a Markov random field model [23].
It should be borne in mind that other proposals have been made regarding the nature of bistable perception.For example, Dayan [80] has proposed a deterministic generative and recognition model for binocular rivalry with an emphasis on competition between top-down hypotheses rather than bottom-up stimulus information.Here, switching between percepts was implemented with a simple fatigue process in which stable states slowly become unstable, resulting in perceptual oscillation.
From a computational perspective, the idea that populations of cells may be sampling from posterior densities is an attractive one.The sampling approach has become a standard method for inverting Bayesian models in statistics and engineering [21].It is best suited, however, to low-dimensional problems, because the algorithms become very slow in high dimensions.It is popular in statistics and engineering because it is much more likely than deterministic methods to produce globally optimal posteriors.One method for encouraging this is to have a "temperature" parameter which starts off high and is gradually reduced over time, according to an annealing schedule.Annealed Importance Sampling, for example, is a gold standard method for approximating the model evidence [26].Sampling approaches have been used in neural network models from the Boltzmann machine, to sparse hierarchical models and Deep Belief Networks (see Section 4.4).
In models with Gaussian observations the temperature corresponds to the precision of the data.As we shall see later, precisions have been proposed to be at least partly under the control of neuromodulators, so it seems reasonable to suggest that sampling based inference may be guided towards global optima via neuromodulation.

Spontaneous Activity. If neuronal populations encode
Bayesian models of sensory data then this predicts a particular relationship between spontaneous and evoked neural activity.This has been investigated empirically by Berkes et al. [81].If stimulus y is caused by event x then a Bayesian model will need to represent the prior distribution over the cause, p(x), and update it to the posterior distribution p(x | y).If this procedure is working properly then the average posterior (evoked) activity should be approximately equal to the prior activity.That is where y i are samples from the environment.Here the left hand side is the prior and the right hand side is averageevoked activity.This prediction was later confirmed by research from the same team who analysed visual cortical activity of awake ferrets during development [81].The similarity between spontaneous and average-evoked activities, as measured using KL-divergence (see Section 2), increased with age and was specific to responses evoked by natural scenes.Fiser et al. [6] argue that the above relationship between spontaneous and average-evoked activity fits more naturally with a sampling view of neural coding.
4.4.Generative Models.This section describes macroscopic models of cortical processing either of single brain regions or of processing in hierarchical models [2,82].The work reviewed in this section is very closely related to that described in Section 5, the main difference being that Section 5 proposes a specific mapping onto cortical anatomy based on predictions, prediction errors, and the lamina structure of cortex.
Early models of hierarchical processing in cortex focus on feedforward processing.This transforms sensory input by static spatiotemporal filtering into more abstract representation and produces object representations that are translationally and viewpoint invariant as shown, for example, by Fukushima [83], Riesenhuber and Poggio [84], and Stringer and Rolls [85].
An alternative view on cortical processing is the idea of analysis-by-synthesis which suggests the cortex has a generative model of the world and that recognition involves inversion of this model [86].This very general idea has also become known as predictive coding.
This idea is combined with Helmholtz's concept of perception as inference in the Helmholtz machine [27].This is an unsupervised learning approach in which a recognition model infers a probability distribution over underlying causes of sensory input, and a separate generative model is used to train the recognition model.The approach assumes causes and inputs are binary variables.Both recognition and generative models are updated so as to minimise a variational free energy bound on the log model evidence.This implicitly minimises the Kullback-Liebler divergence between the true posterior density over causes and the approximation posterior instantiated in the recognition model (see Section 2.3.2).
Olshausen and Field [87] have proposed a sparse coding model of natural images where the likelihood is a simple linear model relating a "code" to image data, but the prior over code elements factorises and there is a sparse prior over each element.For a given image, most code elements are therefore small with a few being particularly large.This approach was applied to images of natural scenes and resulted in a bank of feature detectors that were spatially localised, oriented, and comprised a number of spatial scales, much like the simple cells in V1.A similar sparse coding approach can explain the properties of auditory nerve cells [88].Later work [89] developed a two-layer model in which cells in the first layer were topographically organised and cells in the second layer were adapted so as to maximise the sparseness of locally pooled energies.Learning in this model produced second layer cells with large receptive fields and spatial invariance much like the complex cells in early visual cortex.
These sparse coding models have shown how responses of cells in one or two layer cortical networks can develop via learning in the appropriate generative models, but have been unable to explain how coding develops in multiple layers of cortical hierarchies.Recent progress in this area has been made using Deep Belief Networks (DBNs) [90].These are probabilistic generative models composed of multiple layers of stochastic binary units.The top two layers have undirected, symmetric connections between them and form an associative memory, and the lower layers receive top-down directed connections from the layer above.Inference proceeds using sampling (see Section 2.3.1), and the approach allows nonlinear distributed representations to be learnt a layer at a time [91].
DBNs are based on many years of development starting with the Boltzmann machine, a network of binary stochastic units comprising hidden and visible units.This employs a type of probabilistic model called an undirected graph, where connected nodes are mutually dependent [16] (these are not DAGs).This then led to a Restricted Boltzmann Machine (RBM) where there are no connections among hidden units.DBNs can then be formed by stacking RBMs, such that hidden layer unit activities in lower level RBMs become training data for higher level RBMs.Hinton [91] notes that the key to efficient learning in these hierarchical models is the use of undirected units in their construction.

Cortical Hierarchies
This section describes models of Bayesian inference in cortical hierarchies by Mumford [10], Rao and Ballard [7] and a more recent series of papers by Friston [1,11,12].We very briefly review the basics of cortical anatomy, describe the modelling proposals, and then provide a concrete example.

Functional Anatomy.
The cortex is a thin sheet of neuronal cells which can be considered as comprising six layers, each differing in the relative density of different cell types.The relative densities of excitatory to inhibitory cells change from one cortical region to another, and these differences in "cytoarchitecture" can be used to differentiate, for example, region V1 from V2 [92,93].Despite these differences there are many commonalities throughout cortex.For example, layer 4 comprises mainly excitatory granule cells, and so is known as the granular layer.Other layers are also referred to as being agranular.The functional activity of a cylindrical column through the cortical sheet capturing several thousand neurons has been described in the form of canonical microcircuit [94].This circuit is proposed to be replicated across cortex, providing a modular architecture for neural computation.
It is now well established that cortical regions are arranged in hierarchies.Felleman and van Essen [92], for example, used anatomical properties to reveal the hierarchical structure of the macaque visual system.Anatomical connections from lower to higher regions originate from superficial layer 2/3 pyramidal cells and target the granular layer [92].Anatomical connections from higher to lower areas originate from "deep" layer 5/6 pyramidal cells and target layers 1 and 6 (agranular layers).This connectivity footprint is depicted in Figure 8.This is a generic pattern of connectivity within Anatomical connections from higher to lower areas originate from layer 5/6 pyramidal cells and target layer 1/6 cells in lower regions (shown in purple).Adapted from Shipp [95].
cortex, although it is more clearly manifested in some brain areas than others [95].Kennedy and Dehay [97] note that cortical hierarchies do not form a strict chain, for example, V1 can make a direct feedforward connection to V4 as well as indirectly through V2.They note that "hierarchical distance" can be defined in terms of laminar connectivity patterns.Long distance feedforward connections arise strictly from the supragranular layer (as Felleman and van Essen), but shorter distance ones also have contributions from infragranular layers.
Functionally, one key concept here concerning visual cortex, for example, is that there are separate "what" and "where" hierarchies although this is being challenged by recent perspectives in active vision [98].There is a good deal of evidence showing that these higher level representations are more enduring [99].This makes sense as more abstract causes in our sensory world exist on a longer time scale that is objects may move, they may even change shape or colour, but they are still the same object.
If sensory input is at the bottom of the hierarchy then what is at the top?One idea is that rather than there being a top and a bottom there is an "inside" and an "outside" [1,96].That is, there is a centre rather than a top.Brain regions around the outside receive information from different sensory modalities; vision, audition, touch.The next level in represents higher level modality specific information, such as lines and edges in the visual system or chirps and formants in the auditory system.As we progress closer to the centre, brain regions become multimodal as depicted in Figure 9. [10] has proposed how Bayesian inference in hierarchical models maps onto cortical anatomy.Specifically, he proposes that topdown predictions are sent from pyramidal cells in deep layers and received by agranular layers (purple arrows in Figure 8), and that prediction errors are sent from superficial pyramidal cells and are received by stellate cells in the granular layer (red arrows in Figure 8).Rao and Ballard [7] describe a predictive coding model of visual cortex in which "extraclassical" receptive field properties emerge due to predictions from higher levels.When the model is presented with images of an extended bar, for example, first layer cells processing input from near the end of the bar soon stop firing as the presence of signal at that location is accurately predicted by cells in the second layer which have larger receptive fields.This "end-stopping" effect in first layer cells is explained by there being no prediction error to send up to the second layer.By this later time, cells in the second layer already know about the bar.

Hierarchical Predictive Coding. Mumford
In related work Rao and Ballard [100] consider a similar model, but where hidden layer representations are intrinsically dynamic.Inference in this model is then implemented with an Extended Kalman Filter (see Section 2.4).These dynamics embody a nonlinear prediction step which also helps to counteract the signal propagation delays introduced by the different hierarchical levels (see Section 3.2 for a discussion of this issue).
Lee and Mumford [31] review evidence from human brain imaging and primate neurophysiology in support of the hypothesis that processing in visual cortex corresponds to inference in hierarchical Bayesian models.They describe activity in visual areas as being tightly coupled with the rest of the visual system such that long latency V1 responses reflect increasingly more global feedback from abstract high level features.This is consistent with a nonlinear hierarchical and dynamical model and they propose that inference in this model could be implemented using particle filtering (see Section 2.4.1).
George and Hawkins [19] describe a "hierarchical temporal memory" model of activity in cortical hierarchies which makes spatio-temporal predictions.Inference in this model is based on the belief propagation algorithm and detailed proposals are made regarding the mapping of various computational steps onto activity in different cortical laminae.
A series of papers by Friston [1,11] review anatomical and functional evidence for hierarchical predictive coding, and describe implementations of Mumford's original proposal [10] with increasing levels of sophistication.These include the use of continuous-time nonlinear dynamical generative models and the use of a generalised coordinate representation of state variables.This concept from control theory provides a representation of higher order derivatives such as position, velocity, and acceleration, variables which have natural representations in the brain.Generalised coordinates effectively provide an extended time window for inference and may also provide a mechanism for postdiction (described in Section 3.2).This series of papers also describes a variational inference algorithm for estimating states (inference) and parameters (learning) and how these computations map onto cortical laminae.In later work [101] this framework was extended by expressing sensory input as a function of action, which effectively repositions an agent's sensory apparatus.The same variational inference procedures can then be used to select actions.This active inference framework is explained in recent reviews [12,28].

Two-Level Model.
We now describe a simple Bayesian model of object recognition which illustrates many of the previously described features.This is a simplified version of the models described by Rao and Ballard [7].We focus on perception, that is, how the beliefs regarding the hidden variables in the network can be updated.For simplicity, we focus on a hierarchical model with just two levels, although the approach can be applied to models of arbitrary depth.
The identity of an object is encoded by the variable x 2 , the features of objects by the variable x 1 , and a visual image by y.The model embodies the notion that x 2 causes x 1 which in turn causes y.The probabilistic dependencies in the associated generative model can be written as One can derive update rules for estimating the hidden variables by following the gradient of the above joint likelihood, or equivalently the log of the joint likelihood.This will produce MAP estimates of the hidden variables (see Section 2).Taking logs gives log p y, x 1 , We now make the additional assumption that these distributions are Gaussian.To endow the network with sufficient flexibility of representation, for example the ability to turn features on or off, we allow nonlinear transformations, g( ), between layers.That is, where g 1 (x 1 ) and g 2 (x 2 ) are top down predictions of lower level activity based on higher level representations, and λ i are precision parameters.This can also be written as y = g 1 (x 1 ) + e 1 , x 2 = e 3 . ( One can then derive the following update rules for the hidden variables where g ( ) denotes the derivative of the nonlinearity and the prediction errors are given by e 1 = y − g 1 (x 1 ), Figure 10 shows the propagation of predictions and prediction errors in this two-level network.The parameter τ in (34) determines the time scale of perceptual inference.The degree to which the activity of a unit changes as a function input is referred to as "gain."In (34) the input is the bottom up prediction error.The gain is therefore dependent on the precision λ i and the slope of the nonlinearity g ( ).There are therefore at least two gain control mechanisms.These will change the balance between how much network dynamics are dependent on top-down versus bottom-up information.Similar equations can be derived for how to update the parameters of the model, as shown in [7].

Gain Control.
The key element of a digital computer is a voltage-gated switch, the transistor, which is turned on and off by the same sorts of currents it controls.An understanding of neuronal gain control is important to computational neuroscience [102].Simple sensory reflexes, for example, can be turned off and replaced by responses based on higher level cognitive processing.There are a number of potential mechanisms in the brain for gain control including synchronization, neuromodulation, recurrent dynamics, and inhibition.(34) shows that the gain of a unit is dependent on the slope of the nonlinearity g ( ).If we interpret a unit as reflecting the activity of a population of cells then this slope can be increased, for example, by increasing the synchronization among cells.Highly synchronized cell populations have large gain [103].In addition, this gain can be amplified by recurrent computation in neural networks [102,104].34) also shows that gain can be changed by manipulating the precision λ i .It has Figure 10: Predictive coding architecture for inference in hierarchical models.Each level in the hierarchy is located in a different brain region.Each region has a population of error units and a population of causal units.The error units are hypothesised to reside in superficial cortical laminae and causal units in deep laminae.Error units receive messages from the state units in the same level and the level above, whereas state units are driven by error units in the same level and the level below.The person near the centre of the image would be difficult to see without a topdown prediction that there was somebody walking along the path.This prediction may be derived from previous time steps, hence the need for dynamic models, or from higher level scene knowledge that people walk on paths.been proposed that neuromodulators can change λ i and so modulate the encoding of uncertainty.Neuromodulators are generated in subcortical nuclei and distributed to large regions of cortex.Different neuromodulators project to different cortical regions.For example, the highest concentrations of dopamine are found in striatum, basal ganglia, and frontal cortex.The detailed spatial specificity and temporal dynamics of neuromodulatory projections are unknown but they are thought to act as macroscopic signals [105].

Neuromodulation. Equation (
Yu and Dayan [106] have considered the computational problem of assessing the validity of predictive cues in various contexts.Here a context reflects the set of stable statistical regularities that relate environmental entities such as objects and events to each other and to our sensory motor systems.They propose that Acetylcholine (ACh) signals the uncertainty that is expected within a given context and that Norepinephrine (NE) signals the uncertainty associated with a change in context.Increasing levels of ACh and NE therefore downweight the strength of top-down (contextual) information and effectively upregulate bottom-up sensory input.
It has also been proposed that dopamine signals uncertainty in reward delivery [107].This proposal has been elaborated upon by Friston et al. [108] who propose that dopamine balances the relative weighting of top-down beliefs and bottom-up sensory information when making inferences about cues that reliably signal potential actions.A dynamical model of cued sequential movements was developed in which inference proceeded using the variational approach described earlier, and the resulting simulated behaviours were examined as a function of synthetic dopamine lesions.[102] suggests that small differences in gain from, for example, synchronization can be amplified via dynamics in recurrent networks.Yu and Dayan [109] have used such dynamics in a model of visual attention.They developed a generative model of the Posner attentional task where a central cue predicts the location of a stimulus which then has a property (orientation) about which subjects have to make a decision, for example, press the left if the stimulus points left.Here there are two feature dimensions; spatial location and orientation.Inference in the Yu and Dayan model then shows how priors in one feature dimension (spatial) can gate inference in the other (orientation).This is consistent with electrophysiological responses whereby spatial attention has a multiplicative effect on orientation tuning of visual cortical neurons.

Recurrent Dynamics. Abbott
In the Yu et al. [45] study of the Eriksen Flanker task, referred to in Section 3.2, an approximate inference algorithm was proposed.This assumed a default assumption that the stimuli would be congruent and processing could proceed using a feedforward network in which "congruent" connections were facilitated using gain control.But upon detection of response conflict, an "incongruent" set of feedforward connections would instead be facilitated.

Receptor Pharmacology.
Long range connections in the brain, both bottom-up and top-down, are excitatory and use the neurotransmitter glutamate.Glutamate acts on two types of postsynaptic receptor (i) AMPA receptors and (ii) NMDA receptors.NMDA receptors have a different action depending on the current level of postsynaptic potential, that is, they are voltage-gated.There is known to be a greater proportion of NMDA receptors for top-down connections which therefore provides a mechanism for top-down signals to gate bottom-up ones.
Corlett et al. [110] review the action of various drugs on psychotic effects and describe their action in terms of their receptor dynamics and inference in hierarchical Bayesian networks.Ketamine, for example, upregulates AMPA and blocks NMDA transmission.This will increase bottom-up signalling, which is AMPA-mediated, and reduce top-down signalling which is NMDA mediated.They suggest this will in turn lead to delusions, inappropriate inference of high level causes.Bayesian models of psychosis and underlying links to the pharmacology of synaptic signalling are discussed at length in [8].See also [111] for a broader view of computational modelling for psychiatry.

Planning and Control
This review has briefly considered optimal decision making in terms of the likelihood ratio tests that may be reported by single neurons [15].But as yet, we have had nothing to say about sequential decisions, planning, or control.Here, the key difference is that our decisions become actions which affect the state of the world which will in turn affect what the next optimal action would be.Because the combination of (potential) actions grows exponentially with time this is a difficult computational problem.It is usually addressed using various formalisms, from optimal control theory [112,113] to reinforcement learning [114].For reviews of these approaches applied to neuroscience see [115,116].
Here we focus on recent theoretical developments in this area where research has shown how problems in optimal control theory, or "model-based" reinforcement learning, can be addressed using a purely Bayesian inference approach.For example, Attias [117] has proposed that planning problems can be solved using Bayesian inference.The central idea is to infer the control signals, u n , conditioned on known initial state x 1 and desired goal states x n .For example, Toussaint [118] describes the estimation of control signals using a Bayesian message passing algorithm which defaults to a classic control theoretic formulation for linear Gaussian dynamics.This framework can also be extended to accommodate desired observations, Y N .The appropriate control signals can then be computed by estimating the density p(u n | x 1 , Y N ) which can be implemented using backwards inference (see Section 6).This approach is currently being applied to systems level modelling of spatial cognition [119].
Similarly, Todorov has shown how control theoretic problems become linearly solvable if the cost of an action is quantified by penalising the difference between controlled and uncontrolled dynamics using Kullback-Liebler divergence [120].Computation of optimal value functions is then equivalent to backwards inference in an equivalent dynamic Bayesian model [121] (see Section 2.4).
We refer to the above approaches using the term Planning as Inference.Planning as Inference requires the propagation of uncertainty forwards and backwards in time.This can be implemented using the forwards and backwards inference procedures described earlier.For these algorithms to be implemented in the brain we must have an online algorithm such as the gamma recursions.An advantage of considering control and planning problems as part of the same overall Bayesian inference procedure is that it becomes very natural to model the tight coupling that systems neuroscientists believe underlies action and perception [98,122].

Discussion
This paper has hopefully shown that Bayesian inference provides a general theoretical framework that explains aspects of both brain activity and human behaviour.Bayesian inference can quantitatively account for results in experimental psychology on sensory integration, visual processing, sensorimotor integration, and collective decision making.It also explains the nonlinear dynamical properties of synapses, dendrites, and sensory receptive fields where neurons and neural networks are active predictors rather than passive filters of their sensory inputs.
More generally, the field is beginning to relate constructs in Bayesian inference to the underlying computational infrastructure of the brain.At the level of systems neuroscience brain imaging technologies are likely to play a key role.For example, neuroimaging modalities such as Electroencephalography (EEG) and Magnetoencephalography (MEG) are thought to mainly derive from superficial pyramidal cells.Cortical signals measured with these modalities should therefore correspond to prediction error signals in the hierarchical predictive coding models described in Section 5. Transcranial Magnetic Stimulation (TMS) can be used to knock out activity in various brain regions and therefore infer which are necessary for perceptual inference [31].Functional Magnetic Resonance Imaging (fMRI) can be used to monitor activity in lower level regions that may be explained away by activity in higher level regions [31].An important recent development is the use of dynamic models of brain connectivity to estimate strengths of connections between regions [123].This allows for the quantitative assessment of changes in top-down or bottom-up signalling from brain imaging data [124,125].
A particularly exciting recent theoretical development is the notion of Planning as Inference described in Section 6.Previously, Bayesian inference has been used to explain perception and learning.This recent research suggests how Bayesian inference may also be used to understand action and control.This closes the loop and reflects the tight coupling that systems neuroscientists believe underlies action and perception in the human brain [98,122].Central to this endeavour are the forwards and backwards recursions in time that are necessary to compute optimal value functions or control signals.Our review has also suggested, in Section 3.2, that they may also be necessary to model perceptual inference at a much shorter time scale.

Figure 1 :
Figure 1: This paper reviews work which relates constructs in Bayesian inference to those in experimental psychology and neuroscience.

Figure 2 :
Figure 2: Estimating the position of the ball when it first lands.The prior is shown in blue, the likelihood distribution in red, and the posterior distribution with the white ellipse.The maximum posterior estimate is shown by the magenta ball.This estimate can be updated in light of new information about the balls trajectory (yellow).Adapted from Wolpert and Ghahramani[13].

Figure 3 :
Figure 3: Bayes rule for Gaussians.For the prior p(x) (blue) m 0 = 20, λ 0 = 1 and the likelihood p(y | x) (red) m D = 25 and λ D = 3, the posterior p(x | y) (magenta) shows the posterior distribution with m = 23.75 and λ = 4.The posterior is closer to the likelihood than the prior because the likelihood has higher precision.Bayes rule for Gaussians has been used to explain many behaviours from sensory integration to collective decision making.

Figure 5 :
Figure 5: Perception as forwards inference over states.In this and subsequent figures, the gray shading indicates a known variable.Perception here corresponds to estimation of hidden state density p(x n | U n , Y n ) given known motor efference copy U n and sensory input Y n .Here and in later figures, the red arrows indicate temporal dependencies, and U n and Y n indicate sequences up to time n (see main text).These dynamical models have been used to explain sensorimotor integration and sensorimotor learning.

Figure 7 :
Figure 7: Planning as forward and backwards inference over states and controls.Planning can be formulated as estimation of a density over actions p(U N | x 1 , Y N ) given current state x 1 and desired sensory consequences, Y N .

Figure 8 :
Figure8: Anatomical connections from lower to higher regions in a serial cortical hierarchy originate from superficial layer 2/3 pyramidal cells in an ascending pathway (shown in red).Anatomical connections from higher to lower areas originate from layer 5/6 pyramidal cells and target layer 1/6 cells in lower regions (shown in purple).Adapted from Shipp[95].

Figure 9 :
Figure 9: Cortical architecture depicting multimodal areas in the centre and unimodal sensory processing regions on the periphery, with visual regions shown at the bottom and auditory regions on the right.Adapted from Mesulam [96].