How Much Training Data Do INeed
"How much training data do I need?"
The above question comes up quite a lot when using xrate. Here's a back-of-envelope calculation to guide such decisions.
The amount of data you need is determined by the "slowest event rate", i.e. rate at which the slowest event occurs per site at equilibrium.
The rate of mutation at equilibrium is
, so the slowest event rate is
.
Suppose that is the total branch length in the tree, i.e. the total elapsed time per site.
Let be the number of times you want to observe the slowest event, and let
be the number of sites you'd have to train on to observe the slowest event
times.
Then the total amount of evolutionary time represented by your training data is
and you want
so the number of training sites you need is
.
(Note that the definition of a "site" depends on your chain: a site could be a single alignment column for a neutral DNA model, three columns for a codon model, or two for an RNA basepair model.)
How big should be? Assuming an uninformative prior: if you observe
Poisson-distributed events in time
, then the posterior distribution for the underlying event rate is a gamma distribution with mean
and variance
.
Thus the fractional error, i.e. the ratio of the standard deviation to the mean, is
.
For a desired fractional error of
or less,
you should therefore train on
sites.
Of course the above is a circular argument: it assumes you know ahead of time.
In practice, while you may have some idea of what the slowest event rate will be (based on previous experience and data),
any estimate you might have for
is of order-of-magnitude accuracy at best.
We can extend the above line of reasoning to parametric models (see xgram format page for info). When evaluating the slowest event rate ,
we should allow for parametric chains where multiple mutations
share the same rate parameter
.
If an event rate
is some function of a rate parameter
, then
gives the effective contribution of
to
.
A better definition of
is therefore
.
-- Ian Holmes - 29 Sep 2006