backtranambig

 

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Back-translate a protein sequence to ambiguous nucleotide sequence

Description

backtranambig reads a protein sequence and writes the nucleic acid sequence it could have come from. It does this by using nucleotide ambiguity codes that represent all possible codons for each amino acid.

Algorithm

backtranambig needs a genetic code to generate an ambiguous codon for each amino acid. The default genetic code is the standard ("Universal") code, although many others are available via the '-table' qualifier. The codon usage tables correspdonding to these codes must exist in the EMBOSS data directory. See the section on "Data Files" below for more information.

Usage

Here is a sample session with backtranambig


% backtranambig 
Back-translate a protein sequence to ambiguous nucleotide sequence
Input (gapped) protein sequence(s): tsw:opsd_human
(gapped) nucleotide output sequence(s) [opsd_human.fasta]: 

Go to the input files for this example
Go to the output files for this example

Command line arguments

Back-translate a protein sequence to ambiguous nucleotide sequence
Version: EMBOSS:6.4.0.0

   Standard (Mandatory) qualifiers:
  [-sequence]          seqall     (Gapped) protein sequence(s) filename and
                                  optional format, or reference (input USA)
  [-outfile]           seqoutall  [.] (Aligned) nucleotide
                                  sequence set(s) filename and optional format
                                  (output USA)

   Additional (Optional) qualifiers:
   -table              menu       [0] Genetic code to use (Values: 0
                                  (Standard); 1 (Standard (with alternative
                                  initiation codons)); 2 (Vertebrate
                                  Mitochondrial); 3 (Yeast Mitochondrial); 4
                                  (Mold, Protozoan, Coelenterate Mitochondrial
                                  and Mycoplasma/Spiroplasma); 5
                                  (Invertebrate Mitochondrial); 6 (Ciliate
                                  Macronuclear and Dasycladacean); 9
                                  (Echinoderm Mitochondrial); 10 (Euplotid
                                  Nuclear); 11 (Bacterial); 12 (Alternative
                                  Yeast Nuclear); 13 (Ascidian Mitochondrial);
                                  14 (Flatworm Mitochondrial); 15
                                  (Blepharisma Macronuclear); 16
                                  (Chlorophycean Mitochondrial); 21 (Trematode
                                  Mitochondrial); 22 (Scenedesmus obliquus);
                                  23 (Thraustochytrium Mitochondrial))

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -osformat2          string     Output seq format
   -osextension2       string     File name extension
   -osname2            string     Base file name
   -osdirectory2       string     Output directory
   -osdbname2          string     Database name to add
   -ossingle2          boolean    Separate file for each entry
   -oufo2              string     UFO features
   -offormat2          string     Features format
   -ofname2            string     Features file name
   -ofdirectory2       string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-sequence]
(Parameter 1)
seqall (Gapped) protein sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required
[-outfile]
(Parameter 2)
seqoutall (Aligned) nucleotide sequence set(s) filename and optional format (output USA) Writeable sequence(s) <*>.format
Additional (Optional) qualifiers
-table list Genetic code to use
0 (Standard)
1 (Standard (with alternative initiation codons))
2 (Vertebrate Mitochondrial)
3 (Yeast Mitochondrial)
4 (Mold, Protozoan, Coelenterate Mitochondrial and Mycoplasma/Spiroplasma)
5 (Invertebrate Mitochondrial)
6 (Ciliate Macronuclear and Dasycladacean)
9 (Echinoderm Mitochondrial)
10 (Euplotid Nuclear)
11 (Bacterial)
12 (Alternative Yeast Nuclear)
13 (Ascidian Mitochondrial)
14 (Flatworm Mitochondrial)
15 (Blepharisma Macronuclear)
16 (Chlorophycean Mitochondrial)
21 (Trematode Mitochondrial)
22 (Scenedesmus obliquus)
23 (Thraustochytrium Mitochondrial)
0
Advanced (Unprompted) qualifiers
(none)
Associated qualifiers
"-sequence" associated seqall qualifiers
-sbegin1
-sbegin_sequence
integer Start of each sequence to be used Any integer value 0
-send1
-send_sequence
integer End of each sequence to be used Any integer value 0
-sreverse1
-sreverse_sequence
boolean Reverse (if DNA) Boolean value Yes/No N
-sask1
-sask_sequence
boolean Ask for begin/end/reverse Boolean value Yes/No N
-snucleotide1
-snucleotide_sequence
boolean Sequence is nucleotide Boolean value Yes/No N
-sprotein1
-sprotein_sequence
boolean Sequence is protein Boolean value Yes/No N
-slower1
-slower_sequence
boolean Make lower case Boolean value Yes/No N
-supper1
-supper_sequence
boolean Make upper case Boolean value Yes/No N
-sformat1
-sformat_sequence
string Input sequence format Any string  
-sdbname1
-sdbname_sequence
string Database name Any string  
-sid1
-sid_sequence
string Entryname Any string  
-ufo1
-ufo_sequence
string UFO features Any string  
-fformat1
-fformat_sequence
string Features format Any string  
-fopenfile1
-fopenfile_sequence
string Features file name Any string  
"-outfile" associated seqoutall qualifiers
-osformat2
-osformat_outfile
string Output seq format Any string  
-osextension2
-osextension_outfile
string File name extension Any string  
-osname2
-osname_outfile
string Base file name Any string  
-osdirectory2
-osdirectory_outfile
string Output directory Any string  
-osdbname2
-osdbname_outfile
string Database name to add Any string  
-ossingle2
-ossingle_outfile
boolean Separate file for each entry Boolean value Yes/No N
-oufo2
-oufo_outfile
string UFO features Any string  
-offormat2
-offormat_outfile
string Features format Any string  
-ofname2
-ofname_outfile
string Features file name Any string  
-ofdirectory2
-ofdirectory_outfile
string Output directory Any string  
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

Input file format

backtranambig reads one or more protein sequences.

The input is a standard EMBOSS sequence query (also known as a 'USA').

Major sequence database sources defined as standard in EMBOSS installations include srs:embl, srs:uniprot and ensembl

Data can also be read from sequence output in any supported format written by an EMBOSS or third-party application.

The input format can be specified by using the command-line qualifier -sformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir (nbrf), swissprot (swiss, sw), dasgff and debug.

See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.

Input files for usage example

'tsw:opsd_human' is a sequence entry in the example protein database 'tsw'

Database entry: tsw:opsd_human

ID   OPSD_HUMAN              Reviewed;         348 AA.
AC   P08100; Q16414; Q2M249;
DT   01-AUG-1988, integrated into UniProtKB/Swiss-Prot.
DT   01-AUG-1988, sequence version 1.
DT   15-JUN-2010, entry version 128.
DE   RecName: Full=Rhodopsin;
DE   AltName: Full=Opsin-2;
GN   Name=RHO; Synonyms=OPN2;
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC   Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
RN   [1]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RX   MEDLINE=84272729; PubMed=6589631; DOI=10.1073/pnas.81.15.4851;
RA   Nathans J., Hogness D.S.;
RT   "Isolation and nucleotide sequence of the gene encoding human
RT   rhodopsin.";
RL   Proc. Natl. Acad. Sci. U.S.A. 81:4851-4855(1984).
RN   [2]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA].
RA   Suwa M., Sato T., Okouchi I., Arita M., Futami K., Matsumoto S.,
RA   Tsutsumi S., Aburatani H., Asai K., Akiyama Y.;
RT   "Genome-wide discovery and analysis of human seven transmembrane helix
RT   receptor genes.";
RL   Submitted (JUL-2001) to the EMBL/GenBank/DDBJ databases.
RN   [3]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RC   TISSUE=Retina;
RX   PubMed=17974005; DOI=10.1186/1471-2164-8-399;
RA   Bechtel S., Rosenfelder H., Duda A., Schmidt C.P., Ernst U.,
RA   Wellenreuther R., Mehrle A., Schuster C., Bahr A., Bloecker H.,
RA   Heubner D., Hoerlein A., Michel G., Wedler H., Koehrer K.,
RA   Ottenwaelder B., Poustka A., Wiemann S., Schupp I.;
RT   "The full-ORF clone resource of the German cDNA consortium.";
RL   BMC Genomics 8:399-399(2007).
RN   [4]
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA].
RX   PubMed=15489334; DOI=10.1101/gr.2596504;
RG   The MGC Project Team;
RT   "The status, quality, and expansion of the NIH full-length cDNA
RT   project: the Mammalian Gene Collection (MGC).";
RL   Genome Res. 14:2121-2127(2004).
RN   [5]
RP   NUCLEOTIDE SEQUENCE [GENOMIC DNA] OF 1-120.
RX   PubMed=8566799; DOI=10.1016/0378-1119(95)00688-5;
RA   Bennett J., Beller B., Sun D., Kariko K.;
RT   "Sequence analysis of the 5.34-kb 5' flanking region of the human
RT   rhodopsin-encoding gene.";


  [Part of this file has been deleted for brevity]

FT                                /FTId=VAR_004816.
FT   VARIANT     209    209       V -> M (effect not known).
FT                                /FTId=VAR_004817.
FT   VARIANT     211    211       H -> P (in RP4; dbSNP:rs28933993).
FT                                /FTId=VAR_004818.
FT   VARIANT     211    211       H -> R (in RP4).
FT                                /FTId=VAR_004819.
FT   VARIANT     216    216       M -> K (in RP4).
FT                                /FTId=VAR_004820.
FT   VARIANT     220    220       F -> C (in RP4).
FT                                /FTId=VAR_004821.
FT   VARIANT     222    222       C -> R (in RP4).
FT                                /FTId=VAR_004822.
FT   VARIANT     255    255       Missing (in RP4).
FT                                /FTId=VAR_004823.
FT   VARIANT     264    264       Missing (in RP4).
FT                                /FTId=VAR_004824.
FT   VARIANT     267    267       P -> L (in RP4).
FT                                /FTId=VAR_004825.
FT   VARIANT     267    267       P -> R (in RP4).
FT                                /FTId=VAR_004826.
FT   VARIANT     292    292       A -> E (in CSNBAD1).
FT                                /FTId=VAR_004827.
FT   VARIANT     296    296       K -> E (in RP4; dbSNP:rs29001653).
FT                                /FTId=VAR_004828.
FT   VARIANT     297    297       S -> R (in RP4).
FT                                /FTId=VAR_004829.
FT   VARIANT     342    342       T -> M (in RP4).
FT                                /FTId=VAR_004830.
FT   VARIANT     345    345       V -> L (in RP4).
FT                                /FTId=VAR_004831.
FT   VARIANT     345    345       V -> M (in RP4).
FT                                /FTId=VAR_004832.
FT   VARIANT     347    347       P -> A (in RP4).
FT                                /FTId=VAR_004833.
FT   VARIANT     347    347       P -> L (in RP4; common variant).
FT                                /FTId=VAR_004834.
FT   VARIANT     347    347       P -> Q (in RP4).
FT                                /FTId=VAR_004835.
FT   VARIANT     347    347       P -> R (in RP4; dbSNP:rs29001566).
FT                                /FTId=VAR_004836.
FT   VARIANT     347    347       P -> S (in RP4; dbSNP:rs29001637).
FT                                /FTId=VAR_004837.
SQ   SEQUENCE   348 AA;  38893 MW;  6F4F6FCBA34265B2 CRC64;
     MNGTEGPNFY VPFSNATGVV RSPFEYPQYY LAEPWQFSML AAYMFLLIVL GFPINFLTLY
     VTVQHKKLRT PLNYILLNLA VADLFMVLGG FTSTLYTSLH GYFVFGPTGC NLEGFFATLG
     GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLAGWSRYIP
     EGLQCSCGID YYTLKPEVNN ESFVIYMFVV HFTIPMIIIF FCYGQLVFTV KEAAAQQQES
     ATTQKAEKEV TRMVIIMVIA FLICWVPYAS VAFYIFTHQG SNFGPIFMTI PAFFAKSAAI
     YNPVIYIMMN KQFRNCMLTT ICCGKNPLGD DEASATVSKT ETSQVAPA
//

Output file format

The output is a nucleotide sequence containing the most favoured back translation of the specified protein, and using the specified codon usage table (which defaults to human).

The output is a standard EMBOSS sequence file.

The results can be output in one of several styles by using the command-line qualifier -osformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: embl, genbank, gff, pir, swiss, dasgff, debug, listfile, dbmotif, diffseq, excel, feattable, motif, nametable, regions, seqtable, simple, srs, table, tagseq.

See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.

Output files for usage example

File: opsd_human.fasta

>OPSD_HUMAN P08100 Rhodopsin (Opsin-2)
ATGAAYGGNACNGARGGNCCNAAYTTYTAYGTNCCNTTYWSNAAYGCNACNGGNGTNGTN
MGNWSNCCNTTYGARTAYCCNCARTAYTAYYTNGCNGARCCNTGGCARTTYWSNATGYTN
GCNGCNTAYATGTTYYTNYTNATHGTNYTNGGNTTYCCNATHAAYTTYYTNACNYTNTAY
GTNACNGTNCARCAYAARAARYTNMGNACNCCNYTNAAYTAYATHYTNYTNAAYYTNGCN
GTNGCNGAYYTNTTYATGGTNYTNGGNGGNTTYACNWSNACNYTNTAYACNWSNYTNCAY
GGNTAYTTYGTNTTYGGNCCNACNGGNTGYAAYYTNGARGGNTTYTTYGCNACNYTNGGN
GGNGARATHGCNYTNTGGWSNYTNGTNGTNYTNGCNATHGARMGNTAYGTNGTNGTNTGY
AARCCNATGWSNAAYTTYMGNTTYGGNGARAAYCAYGCNATHATGGGNGTNGCNTTYACN
TGGGTNATGGCNYTNGCNTGYGCNGCNCCNCCNYTNGCNGGNTGGWSNMGNTAYATHCCN
GARGGNYTNCARTGYWSNTGYGGNATHGAYTAYTAYACNYTNAARCCNGARGTNAAYAAY
GARWSNTTYGTNATHTAYATGTTYGTNGTNCAYTTYACNATHCCNATGATHATHATHTTY
TTYTGYTAYGGNCARYTNGTNTTYACNGTNAARGARGCNGCNGCNCARCARCARGARWSN
GCNACNACNCARAARGCNGARAARGARGTNACNMGNATGGTNATHATHATGGTNATHGCN
TTYYTNATHTGYTGGGTNCCNTAYGCNWSNGTNGCNTTYTAYATHTTYACNCAYCARGGN
WSNAAYTTYGGNCCNATHTTYATGACNATHCCNGCNTTYTTYGCNAARWSNGCNGCNATH
TAYAAYCCNGTNATHTAYATHATGATGAAYAARCARTTYMGNAAYTGYATGYTNACNACN
ATHTGYTGYGGNAARAAYCCNYTNGGNGAYGAYGARGCNWSNGCNACNGTNWSNAARACN
GARACNWSNCARGTNGCNCCNGCN

Data files

The codon usage table is read by default from "Ehum.cut" in the 'data/CODONS' directory of the EMBOSS distribution. If the name of a codon usage file is specified on the command line, then this file will first be searched for in the current directory and then in the 'data/CODONS' directory of the EMBOSS distribution.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

The ambiguous nucleotide sequence generated by backtranambig can be translated to the original protein using transeq, which will recognise highly redundant codons (for example "WSN" for serine) as being produced by a program such as backtranambig.

References

None.

Warnings

None.

Diagnostic Error Messages

"Corrupt codon index file" - the codon usage file is incomplete or empty.

"The file 'drosoph.cut' does not exist" - the codon usage file cannot be opened.

Exit status

This program always exits with a status of 0, unless the codon usage table cannot be opened.

Known bugs

None.

See also

Program name Description
backtranseq Back-translate a protein sequence to a nucleotide sequence
checktrans Reports STOP codons and ORF statistics of a protein
coderet Extract CDS, mRNA and translations from feature tables
compseq Calculate the composition of unique words in sequences
emowse Search protein sequences by digest fragment molecular weight
freak Generate residue/base frequency table or plot
mwcontam Find weights common to multiple molecular weights files
mwfilter Filter noisy data from molecular weights file
oddcomp Identify proteins with specified sequence word composition
pepdigest Reports on protein proteolytic enzyme or reagent cleavage sites
pepinfo Plot amino acid properties of a protein sequence in parallel
pepstats Calculates statistics of protein properties
plotorf Plot potential open reading frames in a nucleotide sequence
prettyseq Write a nucleotide sequence and its translation to file
remap Display restriction enzyme binding sites in a nucleotide sequence
showorf Display a nucleotide sequence and translation in pretty format
showseq Displays sequences with features in pretty format
sixpack Display a DNA sequence with 6-frame translation and ORFs
transeq Translate nucleic acid sequences
wordcount Count and extract unique words in molecular sequence(s)

Author(s)

Alan Bleasby
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.

History

None

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None