Andreas Heger LJOrthologs For Yuri

From Biowiki
Jump to: navigation, search

Orthologs for Yuri

Wed Sep 13, 2006 9:33 AM

Yuri wants to train a gene predictor grammar and asked for a set of fly gene predictions with the following properties

  • all twelve species
  • 1:1 orthologs
  • single exon and two exon genes
  • +/- 1kb at each end

Working directory is:

/net/cpp-data/backup/andreas/projects/flies/release1v5/analysis/yuri

I use the set full_species set and filter it for all genes that contain one or two exons in dmel:

xpsql "SELECT 'dmel_vs_dmel4|' || prediction_id || '|' || gene_id || '|' || class FROM dmel_vs_dmel4.overview where nintrons < 2" > dmel_one_or_two_exons

This produces 6926 transcripts.

Of these, I find 1987 in the full species set:

python ~/t/filter_tokens.py --apply=dmel_one_or_two_exons --column=1 < ../../orthology_malis/full_species.map > dmel.selected

Getting all the members and clusters:

python ~/t/filter_tokens.py --apply=<(grep -v "#" dmel.selected | cut -f 2) --column=2 <  ../../orthology_malis/full_species.map > full.map

Retrieving the full length sequences:

python ~/gpipe/extract_regions.py -f full.map --multiple --genome-file=../../../predictions/%s/genome --extend-region=1000 --id-format=full > full.fasta

Splitting all into clusters:

grep -v "#" full.fasta | perl ~/gpipe/split_fasta.pl -m full.map -a data/extended_%s.fasta > full.split

Getting my alignments:

for x in extended_*.fasta; do a=`echo ${x} | perl -p -e "s/extended_//; s/_.*//"`; cp ../../../orthology_malis/step1.dir/cluster_${a}.dir/cluster_${a}.raw_mali raw_${a}.fasta; cp ../../../orthology_malis/step1.dir/cluster_${a}.dir/cluster_${a}.bl_mali cleaned_${a}.fasta; done

Finally, taring everything up:

tar -cvzf yuri_malis.tgz yuri

-- TWiki Guest - 13 Sep 2006