Andreas Heger LJOrthologs For Yuri
From Biowiki
Orthologs for Yuri
Wed Sep 13, 2006 9:33 AM
Yuri wants to train a gene predictor grammar and asked for a set of fly gene predictions with the following properties
- all twelve species
- 1:1 orthologs
- single exon and two exon genes
- +/- 1kb at each end
Working directory is:
/net/cpp-data/backup/andreas/projects/flies/release1v5/analysis/yuri
I use the set full_species set and filter it for all genes that contain one or two exons in dmel:
xpsql "SELECT 'dmel_vs_dmel4|' || prediction_id || '|' || gene_id || '|' || class FROM dmel_vs_dmel4.overview where nintrons < 2" > dmel_one_or_two_exons
This produces 6926 transcripts.
Of these, I find 1987 in the full species set:
python ~/t/filter_tokens.py --apply=dmel_one_or_two_exons --column=1 < ../../orthology_malis/full_species.map > dmel.selected
Getting all the members and clusters:
python ~/t/filter_tokens.py --apply=<(grep -v "#" dmel.selected | cut -f 2) --column=2 < ../../orthology_malis/full_species.map > full.map
Retrieving the full length sequences:
python ~/gpipe/extract_regions.py -f full.map --multiple --genome-file=../../../predictions/%s/genome --extend-region=1000 --id-format=full > full.fasta
Splitting all into clusters:
grep -v "#" full.fasta | perl ~/gpipe/split_fasta.pl -m full.map -a data/extended_%s.fasta > full.split
Getting my alignments:
for x in extended_*.fasta; do a=`echo ${x} | perl -p -e "s/extended_//; s/_.*//"`; cp ../../../orthology_malis/step1.dir/cluster_${a}.dir/cluster_${a}.raw_mali raw_${a}.fasta; cp ../../../orthology_malis/step1.dir/cluster_${a}.dir/cluster_${a}.bl_mali cleaned_${a}.fasta; done
Finally, taring everything up:
tar -cvzf yuri_malis.tgz yuri
-- TWiki Guest - 13 Sep 2006