last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 7, part 4 of 4
next page

D. Bayesian Phylogeny - Go climb a tree

Huelsenback JP, Ronquist J (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754-755.

MrBayes Website: http://mrbayes.csit.fsu.edu/

Bayesian methods represent a fundamental intellectual breakthrough to solving problems in almost any field. Bayesian methods attack problems by asking the question, "Would we recognize the best answer if we saw it?".

If we know how to answer that question, the most straightforward solution would be to test all possible answers (trees), and see which one was the most likely one to be generated by the observed data. To illustrate, imagine the set of all probabilities as a surface, each point on the surface representing the probability that a given tree would generate the alignment we see. The best trees, and their variants, will appear as peaks on the surface, and the worst trees and their variants will appear as valleys. 



from http://artedi.ebc.uu.se/course/X3-2004/Phylogeny/Phylogeny-TreeSearch/Phylogeny-Search.html


If you just keep sampling the tree space at random, eventually, you will find all the peaks and valleys. This first step gets you close to the "best" tree. The second step is to do a more thorough search on the best peaks, until the algorithm converges upon a tree that is close to the optimal answer. There is no guarantee of finding the optimal answer, but the answer obtained will be very close. The more times you sample the solution space, the greater the chance of finding the tree that best explains the alignment.

Stated formally this question can be represented as "what is the probability of the model (tree), given the data. In other words, we wish to find the tree which maximizes the probability of seeing that specific alignment, given the data:


 P(model | data)

 ie. what is the probability
that the model would
generate the data we actually see?



If we can calculate this probability, the following algorithm will find the optimal solution:

Calculate P(model | data) for each possible model
Choose the model that maximizes P(model | data)

In this case, the model is the complete tree, and the data is the terminal branches of the tree ie. the multiple sequence alignment.

Two problems:
1) How do we calculate P(model | data)?
2) There are a lot of possible models!

1) Calculating P(model | data)


Bayes Theorem:

where
P(Model|Data) is the posterior probability of the model, given the data
P(Data|Model) is the probability of the data, given the model
P(Model) is the prior probability of the model (without any knowledge of the data). eg, if all models are equally valid, P(model) = 1/the number of possible models
P(Data) is the prior probability of the data (without any knowledge of the model)
prior probability - the probability calculated on theoretical grounds, with no knowledge of the experimental data
posterior probability - the probability of seeing a particular result, calculated from the experimental data.


For a phylogenetic tree  given alignment X, the posterior probability of the tree is



numerator - one tree


denominator - all trees

where
f(X|)  is the probability that tree would generate alignment X



where
is the set of branch lengths for a given tree
is the set of substitution parameters (eg. base substitution)

Note: We can't directly evaluate these integrals, so MrBayes uses a Markov Chain Monte Carlo  method as an approximation.

f() is the prior probability of tree , that is, the probability of all of the mutational events going from the ancestral node of the tree, to give all of the observed nodes. This is usually nothing more than the reciprocal of the number of all possible trees.

The numerator is the likelihood of an alignment. The denominator is the sum of prior probabilities for all possible trees.

For an unrooted tree with s branches,


For a rooted tree,


 

2) Markov Chain Monte Carlo (MCMC) Search

As mentioned above, it is not possible to visit all possible trees, with all possible branch lengths and substitution parameters, and calculate the posterior probability . We can sample the solution space. If we take enough samples, we will find an answer that is close to the optimal answer. The  MCMC algorithm works as follows:

make a random solution N1 the current solution
repeat
pick another solution N2
if Likelihood (N1 < N2) then
replace N1 with N2
else
if Random (Likelihood (N2) / Likelihood (N1)) then
replace N1 with N2
sample (record) the current solution
until (avg. standard dev. of probabilities < 0.01)

(from Introduction to Bayesian phylogenetics [http://www.agapow.net/teaching/phylogenetics-kew-2003])

The MCMC algorithm is referred to as a "hill climbing" algorithm, meaning that as it proceeds through many cycles (often millions of cycles) it tends to record the best answers and discard less likely answers. After a number of initial "burn in" cycles,  the algorithm locates all of the major peaks. Step 4 ensures that the algorithm doesn't get stuck on any one peak in the solution space. Most importantly, the algorithm visits points on the solution space based on their posterior probabilities. That is, it spends the most time in those areas of highest probability (ie. peaks).

The tricky part is knowing when to stop. The rule of thumb is, when the average standard deviation of probabilities becomes less than 0.01, the algorithm has probably converged upon a good solution.

As an extra precaution, MrBayes usually runs several independent analytical chains simultaneously, to ensure that starting conditions do not in some way bias the process.

Advantages:
  • scales in linear time
  • gives rigorous probability estimate for any tree
  • finds the best set of substitution parameters, so these don't need to be known in advance
  • gives a tree close to the optimal tree

Disadvantages:
  • NOT for the beginner!
  • much more complex to use than simple methods such as NJ, parsimony, or maximum likelihood

last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 7, part 4 of 4
next page