# Corrections to "Using guide trees to construct multiple-sequence evolutionary HMMs" (Holmes, 2003)

This page will be used to document corrections to the algorithms published in the following paper:

Currently there is only one correction.

Additional documentation of transducer jargon can be found on the phylocomposer page.

## Collapsed EHMM transition probability

The correction applies to the following portion of the manuscript as published:

• Final paragraph, Section "THE COLLAPSED EHMM" (second column, page i154)
• "For the collapsed transition probability $t(\phi,\phi')$ to be nonzero, it is required [...] or back up (to an ancestor)."

The corrected version, shown below, has been implemented in the DART source code for some time, but has been undocumented save for a few obscure comments in one source file (DartSrc:handel/inline_ehmm.h, constructor for class CollapsedETrans).

The collapsed transition probability $t(\phi,\phi')$ where $\phi$ is the source state and $\phi'$ the destination state can be written as a product of factors:

$t(\phi,\phi') = WXYZ$

The meaning of these factors is as follows.

---

First, recall the following definitions from the original paper (Holmes, 2003):

• Tree nodes are numbered 1,2,3...N and are sorted in preorder
• The "root branch" from node 1 to node 2 must exist, and must be the only branch from node 1
• $\bf D$ is the set of all descendants of node $n$
• Branches are numbered by the node they lead to, i.e. 2,3...N
• $T_n$ is the "branch length" leading to node $n$, so e.g. $T_2$ is the root branch length
• A "branch transducer" is associated with every branch
• $\tau(\psi,\psi';T)$ is the transition probability from $\psi$ to $\psi'$ for a branch transducer with branch length $T$
• $\Psi$ is the state space of the branch transducer (and is assumed to be independent of branch length)
• $\chi(\psi)$ is the state type of $\psi$ for $\psi \in \Psi$, and takes values in ${\tt S}, {\tt E}, {\tt W}, \stackrel{xy}{{\tt M}}, \stackrel{x}{{\tt D}}, \stackrel{y}{{\tt I}} \right\$

Note that "branch length" can be a continuous real variable, but it doesn't have to be; in fact it can be any label associated with the branch.

Also recall, for any Collapsed EHMM state $\phi$, that:

• $\psi_n$ is the state of the transducer on the branch leading to node $n$
• $\chi_n \equiv \chi(\psi_n)$ is the state type of $\psi_n$
• $\beta(\phi)$ is the emitter node of $\phi$, i.e. the highest-numbered Insert node:

$2,\ \ n : \chi_n = \stackrel{y}{\tt I} \right\$

We introduce some new definitions:

• $\gamma(\phi)$ is the first node that is "eclipsed" by the emitter in state $\phi$, or N+1 if none are:

$N+1,\ \ n : n > \beta(\phi), n \notin {\bf D}(\beta(\phi)) \$

• $\upsilon(\psi,\psi';T)$ is the effective transition probability via intermediate wait states:

$\psi'' \in \Psi:\chi(\psi'') = {\tt W}$

• $m = \beta(\phi')$ is a shorthand for the destination emitter.

---

The algorithm for deciding whether a transition is allowed is as follows. The composite transducer state is first partitioned into subtrees as in the following table. For every node $n$ within each subtree, the appropriate condition must be satisfied if the transition is to be allowed:

 Subtree Condition Term in $t(\phi,\phi')$ $2 \leq n < \beta(\phi')$ $\psi_n = \psi'_n$ $W$ $n = \beta(\phi')$ $\tau(\psi_n,\psi'_n;T_n) > 0$ $X$ $\beta(\phi') < n < \gamma(\phi')$ $\tt W$ $Y$ $\gamma(\phi') \leq n < \gamma(\phi)$ $\tt W$ $Z$ $\max\left(\gamma(\phi),\gamma(\phi')\right) \leq n \leq N$ $\psi_n = \psi'_n$ $W$

A description of factors $W,X,Y,Z$ follows.

---

Factor $W$ prevents transducers from changing state if they are prior to the destination emitter, or eclipsed by both source and destination emitters:

$\displaystyle W = \left( \prod_{n=2}^{m-1} \delta(\psi_n = \psi'_n) \right) \times \left( \prod_{n=\max\left(\gamma(\phi),\gamma'(\phi)\right)}^N \delta(\psi_n = \psi'_n) \right)$

---

Factor $X$ corresponds to the change of state of the transducer at the destination emitter:

$X = \tau (\psi_m, \psi'_m ;T_m)$

---

Factor $Y$ collects contributions from descendants of the destination emitter:

$\displaystyle Y = \prod_{n = m + 1}^{\gamma(\phi') - 1} y_n$

where

$\begin{array}{rl} \upsilon(\psi_n,\psi'_n;T_n) & \mbox{if \chi_n \neq {\tt W} or \chi'_n \neq {\tt W}} } \\ \delta(\psi_n = \psi'_n) & \mbox{if \chi_n = \chi'_n = {\tt W}} \end{array$

These transitions can use intermediate Wait states, which is why we use $\upsilon$ rather than $\tau$.

---

Factor $Z$ collects contributions from newly eclipsed transducers:

$\displaystyle Z = \prod_{n = \gamma(\phi')}^{\gamma(\phi) - 1} z_n$

where

$\begin{array}{rl} \tau(\psi_n,\psi'_n;T_n) & \mbox{if \chi_n \neq {\tt W}} } \\ \delta(\psi_n = \psi'_n) & \mbox{if \chi_n = {\tt W}} \end{array$

It is trivial to generalize the above formalism so that the state space $\Psi$ of the branch transducer can depend on the branch label $T_n$, as in the phylocomposer program.

Ian Holmes - 07 Jun 2007