Non Coding RNADatasets

From Biowiki
Jump to: navigation, search

Datasets for Parameterizing/Training and Testing Non-Coding RNA In Silico Methods

N.B.: For now, the page is mostly about ncRNA classification/detection. Secondary structure prediction may follow.


One of the nice things about bioinformatics is that you can get a lot of information fast and cheap, without running expensive, time-consuming experiments. But how good is that information? How much confidence can you have in it? It would be desirable to perform thorough benchmarks of your methods on real (and even artificial, specially constructed) datasets to get both qualitative and quantitative measures of where the strong and weak points are. But before you can even test your computational model, surely you must train or parameterize it on high-confidence data.

The purpose of this page is to maintain information about building, organizing, maintaining, discussing, etc. datasets for such purposes, specifically for non-coding RNA secondary structure prediction and classification from sequence.


Keep in mind that I was reared at the free energy minimization school of ncRNA secondary structure prediction, so these suggestions may not carry over into probabilistic methods. But this is a wiki, so those who know should chime in. Anyone is free to edit this thing for any reason (except spambots).

The Basics

TP, FP, TN, FN = true positives, false positives, true negatives, false negatives

sensitivity = TP / (TP + FN)

specificity = 1 - false positive rate = TN / (TN + FP)

positive predictive value (PPV) = TP / (TP + FP)

negative predictive value (NPV) = TN / (TN + FN)

It would be nice to get at least an estimate of sensitivity, specificity, and PPV before embarking on either a de novo secondary structure prediction endeavor or a whole genome ncRNA screen. Generally, sensitivity and specificity are a trade-off - as one grows, the other diminishes. ROC curves are constructed to show a measure of such a trade-off.

Whole genome ncRNA screens (i.e. de novo ncRNA predictions from sequence)

As the amount of real ncRNA in genomes is tiny compared to the non-ncRNA, maximizing specificity, even at the cost of sensitivity, is crucial - otherwise the number of false positives will be so high as to render the PPV unacceptably low. For a crude example, let's say that a 100 megabase genome (more likely, genome alignment) contains 1000 ncRNAs. We scan it without prefiltering (let's assume no windows are thrown out due to being aligned to gaps, etc.) using 200 base windows, no overlap. (Let's simplify even further by saying that all ncRNAs neatly fit in some window, even though that's rarely true.) Even if our sensitivity is 100% (so we pick up every real ncRNA) and our specificity is 99%, we will get:

  1e8 bases / 200-base window size = 5e5 windows
  5e5 windows - 1000 windows with real ncRNAs = 4.99e5 windows with non-ncRNA (negatives)
  4.99e5 negatives * 0.01 false positive rate = 4990 false positives
  1000 true positives / (1000 true positives + 4990 false positives) = 16.69% positive predictive value

So even with 99% specificity, only 16.69% of our positive hits are real positives. This makes experimental verification a chore, especially considering that expression of many ncRNAs can be transient, low copy number, condition-specific, cell cycle- or cell type-specific, development stage-specific, etc... harvesting so many cell types from an organism under so many conditions is no small task, and the scale of such a project would increase if the number of putative ncRNA candidates increases, so you want to keep the false positive fraction low. Although microarray techology is becoming better and cheaper, so perhaps the significance of this will diminish (anyone who knows some hard numbers care to chime in on this?).

Of course, we made a lot of assumptions here, but in a real screen:

  • the window number will be higher, as we need some window overlap to catch ncRNAs crossing window edges; or, it could be lower depending on your pre-filtering scheme;
  • the sensitivity will be lower (right now, 100% sensitivity to 99% specificity is something to dream about... in reality the sensitivity would be lower to achieve a specificity of such magnitude);
  • who knows how many real ncRNAs there really are for a genome that size and how it varies from species to species (I mean, after all, that's what we're trying to find);
  • many heuristics would be employed to whittle down the pool of candidates for verification, e.g. promoter (or other genomic context) analysis, so even if your method's PPV is low, the net PPV after the heurstics are applied may be higher;
  • etc.

Generating negatives for training/parameterization and testing

A robust set of negatives should contain three things:

  1. non-ncRNA elements that are found in real genomes, to ensure the model doesn't pick up on those;
  2. sequences that have all the same properties as known ncRNAs, except for properties upon which we base our classification - this is to ensure that our model is picking up on the properties we hypothesize separate ncRNA from non-ncRNA, and not some statistical bias that may exist in our set, but may not carry over into real genomes;
  3. completely and utterly random sequence, for some good measure.

This is an "and/or" by the way - that is, the set should be some #1, some #2, some #3, inevitably there will be some overlap between the items, whether deliberate or, well... not deliberate.

Note that as ncRNA prediction from multiple sequence alignments has shown to be more accurate than from single sequences (TODO: put in references), we need not a set of negative sequences, but negative sequence alignments. Or, if you're training a Sankoff-like algorithm that optimizes the alignments... (TODO: finish writing this)

Part #1 is easy - we just pull ORFs, transposable elements, repetitive elements (including microsatellites and telomeres, for good measure), gene flanking sequences (promoters and enhancers and such), DNA binding sites (am I missing anything?) from databases of such known elements. We should ignore 5' and 3' UTR regions of mRNA, of course, since those often form stable, significant secondary structures. Introns should be avoided also.

TODO: we need a list of databases/sources for #1 here.

The alignments should be easy... (TODO: finish writing)

Part #2 is a little more tricky, and depends on the model we're training. The first thing that springs to mind is that nucleotide and dinucleotide frequency distributions should be same as for real ncRNA, except the nucleotides are in a different order - so the content is preserved, but not the structure. The easiest way to achieve this... (TODO: finish writing).

Part #3 is also easy... (TODO: finish writing)

Controlling for a genomic window in a screen

(TODO: write me!)

Normalizing data in your training set

(TODO: Is this valid? Does this apply? This is basically eliminating biases... not sure what to say here...)

(This should be for both reals and negatives.)

Simulating genome screens

One caveat from my own experience with E. coli and S. typhi, Yuri Bendana's experience with D. melanogaster, and the Eddy lab's experience with C. elegans and other worms (word on the street has it) is that the specificity on test sets tends to be overpredicted. That is, your specificity in a real genomic screen will be much lower (presumably... we don't now how many positives are really false positives, but considering how many are predicted, surely most must be). So....

(TODO: write about burying real sequence in negative sequence.)

Is covariation enough?

(TODO: rant about my crazy theories that covariation analysis is not enough and that we need some quantitative data first, and then some additional criteria for the model.)

Where are these genomic false positives coming from?

Good question... and one of these days, we should do an analysis to see what is getting predicted as ncRNA so much. (TODO)


-- Created by: Andrew Uzilov - 13 Apr 2006