last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 7, part 4 of 4
next page

E. Bayesian Phylogeny - Go climb a tree

Huelsenback JP, Ronquist J (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754-755.

MrBayes Website: http://mrbayes.csit.fsu.edu/

For some interesting insights into how Bayes Theorem works in real life see The Bayesian Trap [https://youtu.be/R13BD8qKeTg]

Bayesian methods represent a fundamental intellectual breakthrough to solving problems in almost any field. Bayesian methods attack problems by asking the question, "Would we recognize the best answer if we saw it?".

If we know how to answer that question, the most straightforward solution would be to test all possible answers (trees), and see which one was the most likely one to generated the observed data. To illustrate, imagine the set of all probabilities as a surface, each point on the surface representing the probability that a given tree would generate the alignment we see. The best trees, and their variants, will appear as peaks on the surface, and the worst trees and their variants will appear as valleys. 



from http://artedi.ebc.uu.se/course/X3-2004/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html



Stated formally this question can be represented as "what is the probability of the model (tree), given the data. In other words, we wish to find the tree which maximizes the probability of seeing that specific alignment, given the data:


 P(model | data)

 ie. what is the probability
that the model would
generate the data we actually see?



If we can calculate this probability, the following algorithm will find the optimal solution:

Calculate P(model | data) for each possible model
Choose the model that maximizes P(model | data)

In this case, the model is the complete tree, and the data is the terminal branches of the tree ie. the multiple sequence alignment.

Two problems:
1) How do we calculate P(model | data)?
2) There are a lot of possible models!

1) Calculating P(model | data)


Bayes Theorem:

where
P(Model|Data) is the posterior probability of the model, given the data
P(Data|Model) is the probability of the data, given the model
P(Model) is the prior probability of the model (without any knowledge of the data). eg, if all models are equally valid, P(model) = 1/the number of possible models
P(Data) is the prior probability of the data (without any knowledge of the model)
prior probability - the probability calculated on theoretical grounds, with no knowledge of the experimental data
posterior probability - the probability of seeing a particular result, calculated from the experimental data.


For a phylogenetic tree  given alignment X, the posterior probability of the tree is



numerator - one tree


denominator - all trees

where
f(X|)  is the probability that tree would generate alignment X



where
is the set of branch lengths for a given tree
is the set of substitution parameters (eg. base substitution)

Note: We can't directly evaluate these integrals, so MrBayes uses a Markov Chain Monte Carlo  method as an approximation.

f() is the prior probability of tree , that is, the probability of all of the mutational events going from the ancestral node of the tree, to give all of the observed nodes. This is usually nothing more than the reciprocal of the number of all possible trees.

The numerator is the likelihood of an alignment. The denominator is the sum of prior probabilities for all possible trees.

For an unrooted tree with s branches,


For a rooted tree,


 

2) Markov Chain Monte Carlo (MCMC) Search

As mentioned above, it is not possible to visit all possible trees, with all possible branch lengths and substitution parameters, and calculate the posterior probability . We can sample the solution space. If we take enough samples, we will find an answer that is close to the optimal answer. The  MCMC algorithm works as follows:

make a random solution N1 the current solution
repeat
pick another solution N2
if Likelihood (N1 < N2) then
replace N1 with N2
else
if Random (Likelihood (N2) / Likelihood (N1)) then
replace N1 with N2
sample (record) the current solution
until (avg. standard dev. of probabilities < 0.01)

(from Introduction to Bayesian phylogenetics [http://www.agapow.net/teaching/phylogenetics-kew-2003])

The MCMC algorithm is referred to as a "hill climbing" algorithm, meaning that as it proceeds through many cycles (often millions of cycles) it tends to record the best answers and discard less likely answers. After a number of initial "burn in" cycles,  the algorithm locates all of the major peaks. Step 4 ensures that the algorithm doesn't get stuck on any one peak in the solution space. Most importantly, the algorithm visits points on the solution space based on their posterior probabilities. That is, it spends the most time in those areas of highest probability (ie. peaks).

The tricky part is knowing when to stop. The rule of thumb is, when the average standard deviation of probabilities becomes less than 0.01, the algorithm has probably converged upon a good solution.

As an extra precaution, MrBayes usually runs several independent analytical chains simultaneously, to ensure that starting conditions do not in some way bias the process.

Difference from ML:
  • ML tests the sample space systematically and thoroughally eg. all topologies with iteration of branch lengths. This is why ML takes so long
  • Bayes tests trees sampled at random from the sample space. Iteration of the most promising branches within the sample space allows it to quickly converge upon a solution.

Advantages:
  • scales in linear time
  • gives rigorous probability estimate for any tree
  • finds the best set of substitution parameters, so these don't need to be known in advance
  • gives a tree close to the optimal tree

Disadvantages:
  • NOT for the beginner!
  • much more complex to use than simple methods such as NJ, parsimony, or maximum likelihood

last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 7, part 4 of 4
next page