Bayes Rule to Free Energy

Getting from Bayes' Rule to Free Energy Joint Probabilities Bayes' rule makes the suggestion that even if we cannot know directly about a variable of interest to us, perhaps because it's difficult or inconvenient to measure, there is still an optimal way of updating our beliefs about it given some data that acts as evidence. For one variable to tell us about another, joint probabilities are important. When we can identify pairs of variables - e.g. Seasons (Z) and Weather (S) - as occurring together in the observable outcomes generated by some process, we can then talk about the joint probability that any pair of events, one from each variable, co-occur together - e.g. snow and winter, p( sn, w); the order of the terms doesn't matter. The joint probability distribution then gives us the entire list of joint probabilities for all of the possible event pairs drawn from the two variables, necessarily summing to 1. In the example there are a total of 8 possible event pairs from combining: 2 weathers (S) × 4 seasons (Z). Weather: S Seasons: Z Snow: p(sn) = 0.22 No-Snow: p(ns) = 0.78 Spring: p(sp) = 0.25 p(sn, sp) = 0.02 p(ns, sp) = 0.23 sum p(S, sp) = 0.25 Summer: p(su) = 0.25 p(sn, su) = 0.01 p(ns, su) = 0.24 sum p(S, su) = 0.25 Autumn: p(a) = 0.25 p(sn, a) = 0.03 p(ns, a) = 0.22 sum p(S, a) = 0.25 Winter: p(w) = 0.25 p(sn, w) = 0.16 p(ns, w) = 0.09 sum p(S, w) = 0.25 sum p(sn, Z) = 0.22 sum p(ns, Z) = 0.78 Sum of any probability distribution: p(S), p(Z) or p(S, Z) = 1.00 The probability for a single event occurring from one of the variables - e.g. winter, p(w) - can be derived from the joint probability distribution by summing the joint probabilities of all the event pairs which contain that event. This is necessarily a sum over each of the different events of the opposing variable - e.g. 1 winter (w) × 2 weathers (S): This is known as a marginal probability and all the marginal probabilities from one variable sum to 1 in a marginal probability distribution. One way we might think about the example given is that if the marginal probability of winter is reflected in how many days of the year are in winter - i.e. 91/365 days (3 months) - then the constituent joint probabilities will reflect the partition of winter into days that did or did not have snow - i.e. 58/365 days and 33/365 days respectively. The proportion or fraction of an event's marginal probability that each of these summed joint probabilities account for then gives us a conditional probability distribution for the opposing variable given the occurrence of that event - e.g. p(S | w): Snow: p(sn) = 0.22 No-Snow: p(ns) = 0.78 Spring: p(sp) = 0.25 - - - Summer: p(su) = 0.25 - - - Autumn: p(a) = 0.25 - - - Winter: p(w) = 0.25 p(sn | w) = 0.64 p(ns | w) = 0.36 sum p(S | w) = 1.00 We see in this table above the conditional probability distribution of the weather given winter (S | w). While the joint probabilities of the same row in the first table summed to make p(w), the conditional probability distribution instead give the proportion of p(w) that each of those joint probabilities account for and so they sum to 1 as a probability distribution, with winter's occurrence necessarily fixed in the distribution to the exclusion of the other seasons. Like the joint probabilities before, we might also think of these conditional probabilities as counting the days where winter coincided with either snow or non-snow weather; however, this is taken as a proportion of winter days rather than days in the year - i.e 58/91 days and 33/91 days - and so it just counts snow and non-snow days given the occurrence of winter. We can also see it as describing how winter predicts the occurrence of snow or non-snow weather. There is a different conditional probability distribution for a variable given each event from the opposing variable; in this example the total number of conditional distributions is 4 weathers (S) + 2 seasons (Z). The following table gives another example, this time a conditional distribution of the seasons given snow (Z, sn): Snow: p(sn) = 0.22 No-Snow: p(ns) = 0.78 Spring: p(sp) = 0.25 p(sp |sn) = 0.09 - Summer: p(su) = 0.25 p(su | sn) = 0.045 - Autumn: p(a) = 0.25 p(a | sn) = 0.135 - Winter: p(w) = 0.25 p(w | sn) = 0.73 - sum p(Z | sn) = 1.00 - We can see from both tables above with the example of snow and winter that we can derive conditional probabilities in the same way for either of the co-occurring events in a joint probability: e.g. Which entails by rearrangement what is known as the product rule which will be useful later: e.g. When we re-express an event's marginal probabilities under the product rule we can also interpret it as the average of that event's conditional probabilities given the opposing variable: e.g. Given that the average or mean of some values is found by summing them together and dividing them by the total number of values, we can see its connection to the marginal probability as follows: 1/N acts as the weight or proportion that each value contributes to the average. For a normal average, these are the same for each value, but in some situations we may want to vary the contribution of each value provided the weights all sum to 1. When the weight is looked at as a probability, we can see how the final form of the average is exactly analogous to the form of the marginal probability: In terms of random variables, this weighted average is called the expected value: p(w) = E p(w | S) Conditional and marginal probabilities therefore give an event's probability of occurrence when the opposing variable respectively has or hasn't been specified. Conditional probabilities describe how one event is predicted by the occurrence of another while marginal probabilities average out any influence from the opposing variable. Bayes' Rule Bayes' rule just applies these ideas to beliefs, representing the uncertainty about our beliefs with probabilities that we allocate to competing hypotheses. We can then look at our beliefs before we have observed any data or evidence as a marginal probability distribution, and our beliefs after some data has been observed through the corresponding conditional probability distribution given that data. These probabilities are known as the prior and posterior probability distributions respectively. Using the rules of probability we then potentially have a mathematically consistent method of prescribing how to update the uncertainty of our beliefs with new data. As the aim is to find the probabilities for a set of competing hypotheses (H) given some data (d) , Bayes' rule basically amounts to the conditional probability expression seen earlier: Using the product rule we can further arrange Bayes' rule as: This expression of Bayes' rule is especially useful when we consider how inference typically works, where well-defined types of probability distribution are fit to some data. For instance, the most well known type of probability distribution is the normal distribution, shaped like a bell-curve. The different possible values of parameters which summarize the properties of these probability distributions act as hypotheses about the possible values of some variable of interest which generates data. In a normal distribution, these would be the mean that locates the peak of the bell along a scale, and the variance / precision measuring the bell's width. The prior probability distribution of the parameters are chosen by us, and would ideally reflect the uncertainty of our initial beliefs about the parameters though there are often various different prior distributions that can be chosen. Given specific values of the parameter, a well-defined type of probability distribution will then also prescribe probabilities for what values the data might take from this distribution - the conditional probability for the data given the parameter, p(d | h) , often known as the likelihood. Multiplying a single hypothesis' prior probability p(h) with the conditional probability p(d | h) that denotes how well some data is predicted by that hypothesis then produces a joint probability for that hypothesis-data pair. We can find the posterior probability for any hypothesis from the fraction or ratio of a hypothesis-data pair's joint probability to the total sum of all the competing hypothesis-data pairs. This latter sum is just the marginal probability of the data p(d) ; however, it can be noted that since we only have access to the likelihood and the prior, it's not necessary that this quantity reflects the actual probability of the data like in the example we used. Instead, we should look at it as the degree the data is predicted by the model as a whole, by averaging across the different possible hypotheses or parameter values of a prior probability distribution. We might then also call it the model evidence. It might be also worth noting that the data we analyse isn't necessarily a single point from a probability distribution but can be a whole set of separate observed data points analysed together, each with their own probability of occurring depending on their value. Under the assumption that each point is statistically independent then probabilities for the whole data set can be found by multiplying the point probabilities together. Bayes to Free Energy The main premise of free energy minimisation is that performing Bayes' rule for many realistic problems is too difficult because you end up having to sum over large numbers of terms to find the marginal probability of the data that defines the denominator in Bayes' rule. This might be especially if dealing with a large range of possible parameter values, multiple parameters and multiple variables. In situations which call for a type of model too computationally intensive and therefore difficult to use, we might try to see if we can fit a type of model which makes simplifying assumptions about the parameters, variables and structure of the model and might make it easier to find a good distribution of parameter values or hypotheses that fit some data set. Even if technically wrong and not perfect, this may be useful if we find a distribution good enough at describing the data set. To do this, we must find a way of comparing the posterior probability distribution of the ideally preferred but difficult type of model with the possible probability distributions that we can use from a simpler type of model that's feasible to use. The catch is that we have to do it without knowing the posterior probability distribution of the ideal model. A way to summarise probability distributions is with a quantity called (informational) entropy, seen both as a measure of uncertainty and the average information conveyed by the probability distribution of events from some variable. We might think of uncertainty as a function of the number of possible events a variable can reside in, where uncertainty increases with the number of possible states: e.g. which can be generalised to take into account states occurring with different probabilities: This is always maximised for a variable with n possible states when all of the different states are equiprobable. Conversely, the lower the entropy, the less number of states a system occupies and the more biased it is in occupying some states more than others. The logarithm ( log ) of some number (e.g. 0.125) can be defined as the power that we would raise a base (e.g. 2) by in order to get that number. We can see this in the expression: Here, 2 is the base and -3 is the power we raise 2 by in order to get 0.125. We can then identify that power -3 as the logarithm of 0.125 when using 2 as a base: The choice of base isn't important so long as it is used consistently and so specifying the base is largely not necessary here; but when the base is 2, the unit of measurement is referred to as a bit which I might use generally The entropy has the same form as the weighted average shown earlier: And so we can see it as an average of the negative log probabilities -log p(x) of each of the variable's possible states. This is the self-information, which can be seen as quantifying how much information a state conveys about itself when we observe it and is inversely proportional to a state's probability. We can think of information in terms of the reduction of uncertainty or the ruling out of alternative possibilities when we observe and identify some state. While we can talk about the amount of information or number of bits communicated to us when we identify the states of a system, what if we wanted to preserve this information and communicate it to someone else? We can then talk about information as a cost - the number of bits required to convey some kind of message. We might want to write what was observed into a message of code that could be read by someone else, where lengths of code uniquely map to and identify whatever states that were observed. It can be proved that, when viewing the base used in our measure of information as specifying the number of symbols available for use from a code alphabet (e.g. log 2 entails 2 possible symbols such I and 0), we can equate the amount of information communicated by a system on average (it's entropy) with the minimum average code length necessary to preserve the system's information in code, as stated in Shannon's fundamental noiseless channel theorem. An interesting observation that is hinted by the form of the entropy is that in an optimal code, the cost of encoding events from a variable should be exactly inversely proportional to their probabilities, suggesting that optimal (the minimum possible) code lengths are specific to the probability distributions they encode: e.g. Since the log p(h) is multiplied by the p(h) of the same event. This begs the question of what happens when we encode probability distributions with codelengths not tailored to those distributions? This is specified in the expression called the cross-entropy: Where the self-information part specifying the code length - log q(x) - is from a different probability distribution to the probability it is multiplied by - p(x) . A simplified way to conceptualise this is in terms of a scenario where some information source is generating events at probabilities given by p(x) that we must encode in a message using codewords that have lengths optimised for another distribution - log q(x) . We could then ask the minimum number of code symbols it would cost on average to preserve information about the generated events when using codewords for the wrong distribution. It can be proved that the cross entropy (using the wrong distribution) is always equal or greater in terms of bits of information than when the correct code lengths are being used to code that distribution - this is known as Gibb's inequality (which can be proven with another inequality called Jensen's inequality): Therefore, representing some events using the wrong codewords or descriptions must require a greater cost in bits of information and code lengths to communicate that event to someone. Only when the distributions for p(x) and q(x) are identical can the extra cost be removed. The extra cost can be expressed by what is known as the Kullback Leibler (KL) divergence which expresses the difference between cross-entropy and the entropy of the correct probability distribution: The third row is due to the rule that log (a/b) = -log(b/a). It can then just be written out as: It is implied that the greater this divergence between two probability distributions, the more dissimilar they are and poorer approximations they would be for each. Ideally we would like to use this divergence to compare candidate approximate probability distributions from the simplified type of model to the ideal model's posterior probability distribution. We can then measure their dissimilarity to the ideal posterior by the extra informational cost of using the simplified distributions. The candidate that minimises the most is then the best one: However, doing a KL divergence between q(h) and p(h | d) obviously requires us to know the posterior in advance to make the comparison, which was the initial problem. In fact, we must remove any kind of summation of terms from the ideal model p to make this work since complicated summing is what caused that initial problem of being unable to calculate the posterior. Summing terms from q is acceptable if the model is simple enough. Two things can be done to completely remove the remove p(h | d) and any summing over terms from p : 1. Unintuitively we can do the opposite KL divergence in terms of the extra cost of using the ideal posterior distribution p to approximate the wrong distribution q . This is not the same as the more intuitive divergence but as an undirected measure of divergence it may still be useful and removes the need to sum over any terms from the ideal model p , where the problems began: 2. However, p(h | d) is still in this equation, which we cannot use if it is too difficult to calculate in the first place. An alternative, as in our initial use of Bayes' rule, is to make use of the quantities we have access to - the joint probability distribution in the form of the prior probabilities from the ideal model and the corresponding likelihoods for whatever data set is being looked at - to create a surrogate for the cross entropy using the product rule: It can be noted that when converting expressions to logarithms, multiplication becomes addition and division becomes subtraction and so the product rule is expressed as such. The free energy can then be expressed as follows: Distributions best at minimising this will best approximate the ideal model's posterior probability distribution which obviously minimises the free energy the most. We can see this in two further manifestations from using the product rule to explicitly dissociate the data part of the free energy from a KL divergence part between ideal and simplified distributions. Firstly: Which can then be expressed finally as: Where the expected value of -log p(d) just reduces to -log p(d) for a single data set. This expression shows us that free energy can be minimised by reducing a KL divergence between the approximate distribution and the posterior distribution of the difficult one as was the initial aim at the beginning when selecting an approximate distribution. It also has the side effect of giving us p(d) which is the marginal probability of the data implied by the ideal model, and which was found difficult to calculate initially. The second manifestation: Which can then expressed as: While there is only a single value for p(d) , there is a different p(d | h) for each hypothesis and so it cannot be reduced. This expression shows us that minimising the KL divergence between the ideal posterior and an approximate distribution is equivalent to finding a distribution of hypotheses that predicts well and maximizes the probabilities of the data set given those hypotheses (shown by the weighted average of the likelihood on the left hand side). It also must minimize the KL divergence between the approximate distribution and the prior probability distribution from the ideal model (divergence term on the right hand side). This latter term is determined by how much of a model's prior distribution is redundant when the model is fit to data, often due to accommodating a large number of hypotheses or possible parameter values. This redundancy allows the models as a whole to be better at predicting a broader range of data (as model evidence) at the expense of being poorer at predicting and capturing regularities of any particular data point, known as overfitting. Minimising this KL term encourages an approximate distribution which does not stray too far from the ideal model's prior distribution by being too narrow, keeping its options open about the causes of the data in order to resist overfitting. The idea of free energy minimisation then is that instead of calculating the ideal model's posterior probability directly through Bayes' rule, we take a simpler type of model and pick the distribution of the hypotheses or parameter values with the lowest free energy, suggesting that that distribution is the candidate that closest approximates the ideal posterior while being feasible to use. Though it is clear that the free energy expression was constructed with practical considerations in mind, it's interesting that we can derive a similar equation by a simple rearrangement of Bayes' rule, emphasising their relationship: to: We can then express this in the form of logarithms: And then in terms of entropies: This Bayes' rule 'free energy' can also be rearranged in the same way as shown before: So what is the Free Energy Principle? (Friston, 2006) So we have seen Bayes' rule as a method of inference and free energy minimisation as a method that tries to make this process easier. The idea of the free energy principle is that all living organisms must minimise free energy, meaning they are approximating Bayes' rule and physically embody models of the world. We can see the application of free energy to organisms as follows: Where y refers to a sensory data set that an organism encounters via sensory receptors. θ refers to environmental states out in the world beyond the boundaries of the organism, as described by physics and the natural sciences. Ω refers to the physical states the organism occupies and would include things like its morphology, physiology and biochemistry as well as the position of its body parts and therefore behaviour. q(θ ; Ω) then refers to a conditional probability for environmental states given the physical states of the organism, establishing it as a probabilistic model encoding states of its environment. I'm not sure exactly what this entails except that it is meant to be applied to any living organism and not just brain states. It's hard to say more than there is a mapping between the structure of the organism's physical states and the structure of the environment, even if it is not always necessarily explicitly representational in the sense that we might think of brain states. In a conventionally inferential sense then, an organism would minimise its free energy concerning a sensory data set by specifically altering its physical structure Ω to reduce the divergence term in equation 1. By reducing the divergence between environmental states q(θ ; Ω) encoded by its physical structure and the environmental states causing the sensory data set p(θ | y) , it improves its model of the world beyond its boundaries. This is then equivalent to restricting its encoding of environmental states to those where the average probability of the sensory data set given those states is maximized, which we see when minimizing the data term on the left of equation 2. Another way of saying it is that the organism restricts the encoded environmental states to those whose predictions are matched by the sensory data set. At the same time, this restriction needs to balance a trade-off of the organism keeping its options as open as possible about the repertoire of environmental states that it encodes, as seen by minimizing the KL term on the right of equation 2: I actually think that free energy minimization seems an appropriate way of conceptualising things here since free energy minimization is about fitting approximations of ideal models. It seems obvious to me that biological systems can only be approximations and their structure inevitably limits how well they can model the environment. Even for perceptual faculties like vision, a huge amount of information about the environment is hidden from us. Neither does it really seem possible that any organism could ever model and embody the exact physical structure of their environment directly in their biology. However, a key point seems to be that organisms need to adaptively and selectively interact with their environments to survive. Organisms are highly ordered physical systems that can only exist in a restricted number of physical states. They are therefore vulnerable to decay under the 2nd law of thermodynamics due to the randomness that tends to characterise physical systems and their interactions. Without mitigating, avoiding or even reversing the consequences of randomness through things like metabolic processes, behaviour and other tools in the biological repertoire (e.g. immune systems), then these organisms would cease to exist and dissipate into their surroundings. While metabolic processes and immune systems, among other examples, do this internally, behaviour does this by regulating an organism's interactions with the environment outside of its boundaries. It ensures that, as an open system, an organism's exchanges with its environment are amenable to the preservation as opposed to the random dispersion of its structure - i.e. avoiding mechanical damage, extremes of temperature, maintaining an intake of the molecules required to sustain its internal processes, etc. Organisms must therefore selectively sample only the environmental states that can sustain its continued existence. We can talk about this in terms of minimizing the informational entropy of the sensory states on an organism's boundaries. This just means restricting the number of sensory states that occur for an organism, which entails a reduction in the number of environmental states they interact with. To avoid death, this is necessarily maximising the probability of only those states amenable to their existence. The conventional inferential account must then be altered with special emphasis on the data term in equation 2: Rather than just changing its physical structure to minimize this term, the organism can use action to self-fulfil the predictions of its encoded states. However, if the organism doesn't optimize its encoded states by changing its physical structure to minimize the KL in equation 1, then its actions will fulfill suboptimal predictions which will cause it to die. With successful prediction and action, there is a circular dynamic in that encoded states predict sensory data to be self-fulfilled which will then drive the optimization of encoded states. The argument then seems to go that any living organism must look like it is restricting its interactions with the environment to avoid death and doing this means its free energy must be minimized because restricting its interactions through action depends on optimizing predictions that result in free energy minimization. Criticism? Arguably, a criticism is that anyone can look at an organism's actions and describe the biological mechanisms that generate them as embodying some kind of prediction about the environmental states that are the consequences of those actions. In fact you could probably make the claim about the consequences of any physical system regardless of how well it regulates its states, similar to how you could argue that any physical system is a computer in some sense. You could point to the type of states that characterise an explosion and suggest that the physical states of whatever caused it serve as models which make predictions about explosions. Is this an example of free energy minimization? I don't know but it doesn't seem impossible. In that sense it can be hard to assess to what degree free energy is actually driving the free energy principle. I would certainly question that more if it wasn't for the very strong intuition that models are required for the self-regulation that characterises organisms and in fact some cases of this have been proven, such as with the internal model principle. Since free energy minimization is equivalent to Bayesian inference then it seems that we can ground the free energy principle as another version of this type of claim but specifically applied to living things. To me, this cushions some of the unfalsifiability or triviality that I can see in the free energy principle; but then again, because the free energy principle goes about by presupposing that organisms are models rather than proving that models are necessary, it seems less fundamental than the internal model principle that does try to prove this claim. I guess the unfalsifiable way that organisms are offered as models kind of makes it unnecessary to prove the claim as long as you can produce workable models of living things minimizing free energy, while a proof for something as broad as all living things is probably difficult compared to proofs that apply for a specific mathematical context.