PLNT4610/PLNT7690 Bioinformatics - Lecture 7, part 4 of 4

If we can calculate this probability, the following algorithm will find the optimal solution:

Calculate P(model | data) for each possible model
Choose the model that maximizes P(model | data)

In this case, the model is the complete tree, and the data is the terminal branches of the tree ie. the multiple sequence alignment.

Two problems:

1) How do we calculate P(model | data)?
2) There are a lot of possible models!

1) Calculating P(model | data)

Bayes Theorem:

where

P(Model|Data) is the posterior probability of the model, given the data
P(Data|Model) is the probability of the data, given the model
P(Model) is the prior probability of the model (without any knowledge of the data). eg, if all models are equally valid, P(model) = 1/the number of possible models
P(Data) is the prior probability of the data (without any knowledge of the model)

prior probability - the probability calculated on theoretical grounds, with no knowledge of the experimental data
posterior probability - the probability of seeing a particular result, calculated from the experimental data.

For a phylogenetic tree

given alignment X, the posterior probability of the tree is

numerator - one tree

denominator - all trees

where

f(X|

) is the probability that tree

would generate alignment X

where

is the set of branch lengths for a given tree

is the set of substitution parameters (eg. base substitution)

Note: We can't directly evaluate these integrals, so MrBayes uses a Markov Chain Monte Carlo method as an approximation.

) is the prior probability of tree

, that is, the probability of all of the mutational events going from the ancestral node of the tree, to give all of the observed nodes. This is usually nothing more than the reciprocal of the number of all possible trees.

The numerator is the likelihood of an alignment. The denominator is the sum of prior probabilities for all possible trees.

For an unrooted tree with s branches,

For a rooted tree,

2) Markov Chain Monte Carlo (MCMC) Search

As mentioned above, it is not possible to visit all possible trees, with all possible branch lengths and substitution parameters, and calculate the posterior probability

. We can sample the solution space. If we take enough samples, we will find an answer that is close to the optimal answer. The MCMC algorithm works as follows:

make a random solution N1 the current solution
repeat

pick another solution N2
if Likelihood (N1 < N2) then

replace N1 with N2

else

if Random (Likelihood (N2) / Likelihood (N1)) then

replace N1 with N2

sample (record) the current solution

until (avg. standard dev. of probabilities < 0.01)

(from Introduction to Bayesian phylogenetics [http://www.agapow.net/teaching/phylogenetics-kew-2003])

The MCMC algorithm is referred to as a "hill climbing" algorithm, meaning that as it proceeds through many cycles (often millions of cycles) it tends to record the best answers and discard less likely answers. After a number of initial "burn in" cycles, the algorithm locates all of the major peaks. Step 4 ensures that the algorithm doesn't get stuck on any one peak in the solution space. Most importantly, the algorithm visits points on the solution space based on their posterior probabilities. That is, it spends the most time in those areas of highest probability (ie. peaks).

The tricky part is knowing when to stop. The rule of thumb is, when the average standard deviation of probabilities becomes less than 0.01, the algorithm has probably converged upon a good solution.

As an extra precaution, MrBayes usually runs several independent analytical chains simultaneously, to ensure that starting conditions do not in some way bias the process.

Difference from ML:

ML tests the sample space systematically and thoroughally eg. all topologies with iteration of branch lengths. This is why ML takes so long
Bayes tests trees sampled at random from the sample space. Iteration of the most promising branches within the sample space allows it to quickly converge upon a solution.

Advantages:

scales in linear time
gives rigorous probability estimate for any tree
finds the best set of substitution parameters, so these don't need to be known in advance
gives a tree close to the optimal tree

Disadvantages:

NOT for the beginner!
much more complex to use than simple methods such as NJ, parsimony, or maximum likelihood

last page

PLNT4610/PLNT7690 Bioinformatics
Lecture 7, part 4 of 4

E. Bayesian Phylogeny - Go climb a tree

1) Calculating P(model | data)

2) Markov Chain Monte Carlo (MCMC) Search