Orthologs for Yuri

Wed Sep 13, 2006 9:33 AM

Yuri wants to train a gene predictor grammar and asked for a set of fly gene predictions with the following properties

  • all twelve species
  • 1:1 orthologs
  • single exon and two exon genes
  • +/- 1kb at each end

Working directory is:


I use the set full_species set and filter it for all genes that contain one or two exons in dmel:

xpsql "SELECT 'dmel_vs_dmel4|' || prediction_id || '|' || gene_id || '|' || class FROM dmel_vs_dmel4.overview where nintrons < 2" > dmel_one_or_two_exons

This produces 6926 transcripts.

Of these, I find 1987 in the full species set:

python ~/t/ --apply=dmel_one_or_two_exons --column=1 < ../../orthology_malis/ > dmel.selected

Getting all the members and clusters:

python ~/t/ --apply=<(grep -v "#" dmel.selected | cut -f 2) --column=2 <  ../../orthology_malis/ >

Retrieving the full length sequences:

python ~/gpipe/ -f --multiple --genome-file=../../../predictions/%s/genome --extend-region=1000 --id-format=full > full.fasta

Splitting all into clusters:

grep -v "#" full.fasta | perl ~/gpipe/ -m -a data/extended_%s.fasta > full.split

Getting my alignments:

for x in extended_*.fasta; do a=`echo ${x} | perl -p -e "s/extended_//; s/_.*//"`; cp ../../../orthology_malis/step1.dir/cluster_${a}.dir/cluster_${a}.raw_mali raw_${a}.fasta; cp ../../../orthology_malis/step1.dir/cluster_${a}.dir/cluster_${a}.bl_mali cleaned_${a}.fasta; done

Finally, taring everything up:

tar -cvzf yuri_malis.tgz yuri

