String Transducers
Contents
String transducers
Introduction
String transducers---a.k.a. sequence transducers---are finite-state machines (that is to say, automata) with an input tape and an output tape
(in the case of a two-tape transducer), or an input tape
and multiple output tapes
(in the general, 1-input N-output case).
A two-tape transducer, with one input and one output, is similar to a Pair HMM; the major difference is that the probabilistic weights are normalized conditional on the input sequence.
That is, the transition & emission parameters are such that the Forward algorithm, given and
, computes the conditional probability
.
In contrast, the Forward algorithm for a Pair HMM computes the joint probability
.
Since is the probability that the transducer converts input
to output
, thus
for any
.
A typical way to guarantee that a transducer meets this criterion for all possible
is to normalize every emission (and transition) conditional on the input symbol (and input sequence length).
As with Pair HMMs, string transducers are decent models for substitutions, indels and similar "local" mutations. However, they can also be systematically combined to yield probabilistic models for multiple, phylogenetically-related sequences. This algorithm, Transducer Composition, makes them great tools for statistical alignment, for analyzing sequence evolution on trees and for phylogenomic modeling & reconstruction.
Other names for transducers
Alternate names for string transducers include:
- Wikipedia:Finite_state_transducer;
- Input/Output Automata, Alignment Automata, Stochastic Automata;
- Moore Machines, Mealy Machines;
- "Conditional HMMs", meaning conditionally-normalized Pair HMMs, Multiple HMMs, Evolutionary HMMs or Phylo-HMMs.
We use these terms more-or-less interchangeably, although there are precise and subtly different definitions in common usage for all of them (with well-defined relationships between the various definitions; e.g. we can assert that
"the general Phylo-HMM implemented by xrate is just a special-case Evolutionary HMM without indels: a singleton transducer composed with a phylogenetic composition of branch transducers, none of which have insert or delete states"
and while this may sound cryptic, it is a well-formed and easily provable statement in transducer theory).
Spritually similar models
A number of models and inference algorithms in the molecular evolution algorithms literature are representable as transducers. Some are almost exactly the same thing, e.g.
- Diallo et al.: Exact and heuristic algorithms for the Indel Maximum Likelihood Problem. J. Comput. Biol. 2007;14:446-61.
- Kim & Sinha: Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment. Bioinformatics 2007;23:289-97.
Others are slightly more restrictive; like the TKF91 and TKF92 models, which do not allow indels to overlap
- Thorne et al.: Inching toward reality: an improved likelihood model of sequence evolution. J. Mol. Evol. 1992;34:3-16.
Substitution-only models are also (trivially) representable as transducers, of course; as, in the same way, are the various models that use a finite-state continuous time Markov chain over N alphabet symbols and an extra "gap" symbol, sometimes with corrections that approximate transducer models.
The transducers we use here are finite in their state space, unlike for example the Long Indel Model which introduces realistic gap length distributions to the TKF model, at the cost of increasing the alignment complexity from that of a finite-state pair HMM (or transducer) to that of a generalized pair HMM with arbitrary distributions over emission lengths.
Generalizations of transducers
The Lexicalized Transducer is useful for modeling context-sensitive substitutions, indels and even local micro-duplications and micro-satellite dynamics.
The Parse Tree Transducer can be used for modeling the evolution of "parse trees" (e.g. noncoding RNA gene structures), as well as strings.
It is also possible to generalize the idea to a transducer with multiple inputs as well as outputs
, to model recombination events and other exchanges of genetic material.
Pages on biowiki
There are several pages on this wiki about transducers applied to bioinformatics:
- Phylo Film: animations of transducers, Phylo Grammars and other evolutionary models
- String Transducers
- Software packages
- Phylogenetic Alignment Reader (long-ish bibliography)
Pages on Wikipedia
Wikipedia has some good background on finite-state transducers as used in computer science:
References
- Computational linguistics
- Bioinformatics
- David Searls & Kevin Murphy
- Searls & Murphy: Automata-theoretic models of mutation and alignment. Proc Int Conf Intell Syst Mol Biol 1995;3:341-9.
- http://www.cs.ubc.ca/~murphyk/Papers/ismb95.pdf
- This is an amazingly far-sighted paper with examples including transducers that model frame-sensitive deletions (c.f. Reading Frame Conservation)
- Holmes et al (see also paper archive)
- An alignment-free generalization to indels of Felsenstein's phylogenetic pruning algorithm. Oscar Westesson, Gerton Lunter, Benedict Paten, Ian Holmes (Submitted on 22 Mar 2011)
- Holmes &: Phylocomposer and phylodirector: analysis and visualization of transducer indel models. Bioinformatics 2007;23:3263-4.
- Bradley & Holmes: Transducers: an emerging probabilistic framework for modeling indels on trees. Bioinformatics 2007;23:3258-62.
- Holmes &: Using guide trees to construct multiple-sequence evolutionary HMMs. Bioinformatics 2003;19 Suppl 1:i147-57. (PDF) (errata)
- Miklós et al.: A "Long Indel" model for evolutionary sequence alignment. Mol. Biol. Evol. 2004;21:529-40. (PDF)
- Holmes & Bruno: Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 2001;17:803-20. (PDF)
- Statistical alignment literature
- David Searls & Kevin Murphy
- People interested in and/or developing this stuff (please add yourself)...
- Oscar Westesson
- Robert Bradley
- Gerton Lunter
- Istvan Miklos
- Aaron Mackey
- Jotun Hein
- Alexei Drummond
- Adam Novak
- Benedict Paten
- Dan Klein
- Krishna Roskin
- Rahul Satija
- Lior Pachter
- David Haussler
- Robin Dowell
- Marc Suchard
- David Ardell
- Abdoulaye Banirй Diallo