Guided Training

From Biowiki
Jump to: navigation, search

Guided training

(from known issues with DART)

As has been noted by others, Expectation Maximization (as used by the xrate program) is not the most reliable of algorithms, and can get stuck in local optima. MCMC is of course the right solution to this, for all sorts of reasons (accurate estimates of error being chief amongst them). However, if you're desperately in need of a quick maximum likelihood answer, there are some tricks you can use to avoid local maxima.

Guided training, as described in our 2002 Holmes & Rubin: An expectation maximization algorithm for training hidden substitution models. J. Mol. Biol. 2002;317:753-64., is one such trick. The basic idea is to start by training a restricted parameterisation of the model you are interested in. When that is trained, you incrementally add more parameters to the model, then re-train.

For example, suppose you are interested in estimating a general irreversible rate matrix. You might start by estimating a matrix of the form R_{ij} = \kappa \pi_j (called the "rind" model in xgram), where rates in each column are constrained to be identical. Next, you would estimate a general reversible model using the previous "rind" model as a seed (note that reversible models satisfy the constraint \pi_i R_{ij} = \pi_j R_{ji}, which is slightly weaker than the "rind" constraint). Finally, you would use this reversible matrix as a seed to estimate the general irreversible model (unconstrained except for the general rate matrix constraints \sum_j R_{ij}=0 and \sum_i \pi_i = 1).

-- Ian Holmes - 23 Mar 2007