tranalign Wiki The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki. Please help by correcting and extending the Wiki pages. Function Generate an alignment of nucleic coding regions from aligned proteins Description tranalign is a re-implementation in EMBOSS of the program mrtrans by Bill Pearson. It reads a set of (unaligned) nucleotide sequences and a corresponding set of aligned protein sequences which are the translations, and writes the coding regions to file as a nucleotide sequence alignment. The sequences must be in the same order in the input sets. Each nucleotide sequence is translated in all three forward frames using the specified genetic code and the translations compared to the corresponding protein sequence from input the alignment. The contiguous nucleotide sequence that coded the protein is written to file (it will not splice together different exons to produce a coding sequence). Algorithm The protein sequences will typically include gap (-) characters. These are ignored during sequence comparison but replaced by --- in the nucleotide sequence alignment output. Usage Here is a sample session with tranalign % tranalign ../data/tranalign.pep tranalign2.seq Generate an alignment of nucleic coding regions from aligned proteins Go to the input files for this example Go to the output files for this example Command line arguments Generate an alignment of nucleic coding regions from aligned proteins Version: EMBOSS:6.4.0.0 Standard (Mandatory) qualifiers: [-asequence] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) [-bsequence] seqset (Aligned) protein sequence set filename and optional format, or reference (input USA) [-outseq] seqoutset [.] (Aligned) nucleotide sequence set filename and optional format (output USA) Additional (Optional) qualifiers: -table menu [0] Code to use (Values: 0 (Standard); 1 (Standard (with alternative initiation codons)); 2 (Vertebrate Mitochondrial); 3 (Yeast Mitochondrial); 4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma); 5 (Invertebrate Mitochondrial); 6 (Ciliate Macronuclear and Dasycladacean); 9 (Echinoderm Mitochondrial); 10 (Euplotid Nuclear); 11 (Bacterial); 12 (Alternative Yeast Nuclear); 13 (Ascidian Mitochondrial); 14 (Flatworm Mitochondrial); 15 (Blepharisma Macronuclear); 16 (Chlorophycean Mitochondrial); 21 (Trematode Mitochondrial); 22 (Scenedesmus obliquus); 23 (Thraustochytrium Mitochondrial)) Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-asequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-bsequence" associated qualifiers -sbegin2 integer Start of each sequence to be used -send2 integer End of each sequence to be used -sreverse2 boolean Reverse (if DNA) -sask2 boolean Ask for begin/end/reverse -snucleotide2 boolean Sequence is nucleotide -sprotein2 boolean Sequence is protein -slower2 boolean Make lower case -supper2 boolean Make upper case -sformat2 string Input sequence format -sdbname2 string Database name -sid2 string Entryname -ufo2 string UFO features -fformat2 string Features format -fopenfile2 string Features file name "-outseq" associated qualifiers -osformat3 string Output seq format -osextension3 string File name extension -osname3 string Base file name -osdirectory3 string Output directory -osdbname3 string Database name to add -ossingle3 boolean Separate file for each entry -oufo3 string UFO features -offormat3 string Features format -ofname3 string Features file name -ofdirectory3 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit Input file format The input is a set of unaligned nucleic sequences and the set of aligned protein sequences to be used as a guide in the alignment of the output nucleic sequences. The ID names of the nucleic acid and protein sequences are NOT checked to see if they correspond to each other. They can have any names. There must be at least as many protein sequences as nucleic acid sequence - extra protein sequences are ignored. Each of the nucleic acid sequences must have a corresponding protein sequence which is derived from the coding region of that nucleic acid sequence. The two sets of sequences must be in the same order. Input files for usage example File: tranalign.seq >HSFAU1 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggccccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU2 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU3 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU4 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU5 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtaggccgcatgctttttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa File: tranalign.pep >HSFAU1_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAG-PLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU2_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDALWASAGWRP >HSFAU3_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKGAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU4_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHEIASLEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVAGRMLARGKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU5_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVGRMLFG-GKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS Output file format Output files for usage example File: tranalign2.seq >HSFAU1 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggc---cccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU2 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc--- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --------------------------------------- >HSFAU3 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU4 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU5 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtaggccgcatgctttttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct The output is the regions of the nucleic acid sequences which code for the corresponding protein sequence, with gap characters ('-') introduced so that they have the same alignment as the corresponding protein sequences. Data files None. Notes In general, it is better to use protein sequences for multiple alignment, but to use DNA sequences for phylogeny, for example, when using the programs dnadist, dnapars, dnaml, etc in the PHYLIP package. Where one has a protein sequence alignment, it would be time consuming to remove gap characters before back-translating the proteins. tranalign helps by generating aligned cDNA sequences from a protein sequence alignment. tranalign finds the coding regions for contiguous sequences only. It will not splice together different exons to produce a coding sequence. You should therefore use either mRNA sequences, or nucleic sequences which you have constructed to hold a contiguous coding region (maybe using extractseq or yank and union?). References None. Warnings The sequences must be in the same order in both input sets of sequences. Some alignment program (including clustalw/emma) will re-order their input sequences so as to group similar sequences together. Diagnostic Error Messages "No guide protein sequence available for nucleic sequence xxx" - the corresponding protein sequence for this nucleic sequence has not been input. You have input more nucleic acid sequences than protein sequences. "Guide protein sequence xxx not found in nucleic sequence xxx" - the region of the nucleic sequence which codes for the protein was not found. The coding region in the nucleic acid sequence must be a single contiguous sequence. The protein sequence might not be the corresponding one for this nucleic acid sequence if they are out of order. Exit status It always exits with status 0. Known bugs None. See also Program name Description edialign Local multiple alignment of sequences emma Multiple sequence alignment (ClustalW wrapper) infoalign Display basic information about a multiple sequence alignment plotcon Plot conservation of a sequence alignment prettyplot Draw a sequence alignment with pretty formatting showalign Display a multiple sequence alignment in pretty format Author(s) The original program mrtrans was written by Bill Pearson (wrp@virginia.edu) tranalign was written in EMBOSS code using the description of mrtrans as a guide by Gary Williams formerly at: MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK Please report all bugs to the EMBOSS bug team (emboss-bug (c) emboss.open-bio.org) not to the original author. History mrtrans written (Jan 1991, July 1987) - Bill Pearson tranalign written (March 2002) - Gary Williams Target users This program is intended to be used by everyone and everything, from naive users to embedded scripts. Comments None