One good check is to note how long your sequence should be. If the program tells you it is the correct length, it has probably been read correctly.
|G||Guanine||K||G or T|
|A||Adenine||S||G or C|
|C||Cytosine||W||A or T|
|T||Thymine||H||A or C or T|
|U||Uracil||B||G or T or C|
|R||Purine (A or G)||V||G or C or A|
|Y||Pyrimidine (C or T)||D||G or T or A|
|M||A or C||N||G or A or T or C|
Most programs allow either upper- or lowercase letters for nucleic acids. Also, the most programs automatically convert U's to T's as sequences are read in. By convention (and common sense) sequences are always written5'--->3'. B. Amino acids Typically, programs only allow 1-letter symbolsinput. Also, most programs will only recognize uppercase symbols as amino acids.
|.dna||free format, nucleic acid|
|.fsa, .wrp||nucleic acid or protein, Pearson/FASTA|
||GENBANK, nucleic acid|
|.pro||free format, protein|
|.raw||nucleotides or amino acids only|
||GDE, bioLegato flat files
Suggested file extensions for other types of files include:
|.txt, .asc||ASCII text|
|OpenOffice Writer, StarWriter
OpenOffice Calc, StarCalc
OpenOffice Impress, Star Impress
Although some errors in input file format will be caught, others mayresult in the interpretation of legal nucleotide or amino acid charactersas being incorrectly added to the sequence. For example, if the user attemptedto read in a GenBank file as a free format file, any legal nucleotide charactersin the sequence name would be added to the sequence.
Note: Raw, Free-format, and Pearson/fasta format files do not have a way to specify when sequences are circular!GenBank format allows you to specify 'Circular' on the LOCUS line.
a. Comments are denoted by a semicolon (;). When a semicolon is encountered, the rest of the line is ignored by the programs.Sequences can be typed into a file in any arrangement that is convenient. Blank spaces between bases or amino acids are ignored, and a sequence may run over many lines. Thus, you may skip a space every five or ten basesto make proofreading of the file easier. Any letters that are not in theabove lists of symbols are ignored. This makes it possible to intersperse numbers and other symbols with the actual nucleotides or amino acids to be read in. Comments can be inclueded anywhere in the file to annotate the sequence. Their importance can be likened to comments you might writein your laboratory notebook. Without comments, data you entered a year ago may be meaningless. Below is a sample datafile:
b. Outside of comments, only legal sequence symbols are read as sequence. Other characters (eg. blanks, numerals) are ignored.
c. DNA and RNA symbols may be in either upper or lowercase.
; pEX-A 10/31/81
>name - definitionExample of a FASTA format file containing 2 sequences:
>AI352966 - MB75-5H PZ204.BNlib Brassica napus cDNA clone pMB75-5H
The format for the current GenBank release can be found in the
GenBank Flat File Release notes at [
A formal definition of the GenBank Features Table can be found in TheDDBJ/EMBL/GenBank Feature Table: Definition [ http://www.ncbi.nlm.nih.gov/collab/FT/.
Click here for a sample SwissProt file (P27518)
The Uniprot database unifies protein resources from several databases, including the SwissProt and PIR.