Xrate Grammars
Contents
XRate grammar files
This page describes the repository of XRATE grammar files included in the DART software package.
See xrate format for a description of the file format.
Example xrate grammar files
The dart/grammars subdirectory includes many example grammars for DNA, protein and RNA sequences.
Here are a few examples of working xrate grammar files. The techniques illustrated here can be mixed and matched. Some of the grammars use xrate macros, which is like a tiny lisp-like dialect for specifying repetitively-structured grammars.
Point substitution models
Grammars that implement point substitution models have two (almost trivial) rules: S -> X S where S is a nonterminal and X is an alignment column, and S -> End. The emitted alignment column is generated on some phylogeny (which can be specified in the input Stockholm format alignment file, or will otherwise be estimated from that alignment) using some substitution rate matrix (which is specified as part of the grammar). The symbol X is called a pseudoterminal in xrate format jargon.
These grammar files, then, effectively just illustrate the file format for the substitution rate matrix & the notational principle of tying rate matrices to grammars using pseudoterminals:
- Classic low-dimensional models of point substitution
- jukescantor.eg -- Jukes and Cantor's 1969 model (uniform base frequencies, single substitution rate)
- kimura2.eg -- Kimura's 1980 two-parameter model (transition/transversion bias)
- fels81.eg -- Felsenstein's 1981 model (non-uniform base frequencies)
- hky85.eg -- The HKY85 model (transition/transversion bias and non-uniform base frequencies)
- rev.eg -- General reversible model (DNA bases)
- irrev.eg -- General irreversible model (DNA bases)
- nullprot.eg -- General reversible model (amino acids)
- sn.eg -- Rough approximation to CodeML's f4x3 model (codon model with site-specific nucleotide frequencies, transition/transversion ratio and synonymous/nonsynonymous rates)
- See also tips for codon matrices
The above xrate files illustrate the idea of a basic point substitution model. The following xrate files combine several such models, using a grammar to describe how different substitution models are used for different alignment columns.
Feature predictors
- Protein grammars
- nullprot.eg -- the general reversible model for amino acids
- prot3.eg -- 3-state protein phylo-HMM a la Thorne, Goldman & Jones
- RNA folding grammars (following Hein, Knudsen et al)
- pfold.eg -- RNA folding (see also Knudsen Hein)
- codon.eg -- empirical codon model
- dinuc.eg -- context-dependent substitution process, e.g. CpG avoidance
- RNA gene prediction grammars (following Jakob Skou Pedersen, Irmtraud Meyer et al)
- ncRnaDualStrand.eg -- dual-strand gene-predicting ncRNA grammar similar to Evo Fold
- rnadecoder.eg -- overlapping structure/ORF grammar based on RNA-decoder. Heavy use of macros
- XDecoder.eg -- an improved version of the overlapping structure/ORF grammar, by Oscar Westesson
Lineage-specific evolutionary grammars
- Lineage-specific phylo-grammars, following Adam Siepel, David Haussler et al
- conservation_phylohmm.eg -- phylo-HMM for detecting regions of high conservation. Makes use of iteration macros. Inspired by Phast Cons
- rescaled_branch_phylohmm.eg -- phylo-HMM for detecting regions where one branch of the tree has been rescaled; uses tree iterations. Inspired by DLESS
- ancestral_gc.eg -- model for measuring lineage-specific GC content using tree iterations
Site-specific models
- Column-by-column substitution models, following e.g. Bruno & Halpern, Eisen & Moses, etc.
- site_specific_protein.eg -- site-specific frequencies for protein substitution models using the iteration macros. Inspired by RIND
- site_specific.eg -- site-specific frequencies for DNA substitution models (only difference is the alphabet)
Grammars that use the Scheme interpreter
- Site-to-site variation models
- autodiscgamma.eg -- autocorrelated discretized-gamma distribution over rates
- Yang &: A space-time process model for the evolution of DNA sequences. Genetics 1995;139:993-1005.
- autodiscgamma.eg -- autocorrelated discretized-gamma distribution over rates
- Codon models
- nielsen-yang.eg -- Nielsen-Yang synonymous/nonsynonymous transition/transversion codon model
- Uses the Dart Scheme standard library
- Nielsen & Yang: Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 1998;148:929-36.
-- Ian Holmes - 18 Mar 2009