SEQUENCE FILE FORMATS

A video covering these topics can be viewed at
VIDEO:

A Word of Caution:

Whenever you use a program for the first time, check the documentationto make sure you know what format it expects the sequence to be in. An improperly-formatted file can cause an incorrect sequence to be read in, making your results meaningless! For example, some programs read every character in the file, and throw away anything other than A,G,C,T or U. This type of program is expecting a raw sequence file. However, if you used as input an email message containing a sequence, all the A's, G's,C's 'T's or U's in the message would be read in, along with the actual sequence.

One good check is to note how long your sequence should be. If the program tells you it is the correct length, it has probably been read correctly.

A. Nucleic acid sequences

The IUPAC-IUB symbols for nucleotide nomenclature [Cornish-Bowden (1985)Nucl. Acids Res. 13: 3021-3030.] are shown below:

Symbol	Meaning	Symbol	Meaning
G	Guanine	K	G or T
A	Adenine	S	G or C
C	Cytosine	W	A or T
T	Thymine	H	A or C or T
U	Uracil	B	G or T or C
R	Purine (A or G)	V	G or C or A
Y	Pyrimidine (C or T)	D	G or T or A
M	A or C	N	G or A or T or C

Most programs allow either upper- or lowercase letters for nucleic acids. Also, the most programs automatically convert U's to T's as sequences are read in. By convention (and common sense) sequences are always written5'--->3'. B. Amino acids Typically, programs only allow 1-letter symbolsinput. Also, most programs will only recognize uppercase symbols as amino acids.

3-letter	1-letter	3-letter	1-letter	3-letter	1-letter
Phe	F	Leu	L	Ile	I
Met	M	Val	V	Ser	S
Pro	P	Thr	T	Ala	A
Tyr	Y	His	H	Gln	Q
Asn	N	Lys	K	Asp	D
Glu	E	Cys	C	Trp	W
Arg	R	Gly	G	STOP	*
Asx	B	Glx	Z	UNKNOWN	X
Xle (Leu/Ile)	J	Pyl (pyrrolysine)	O

B. File formats

Programs which read sequences will prompt the user with a choice of file formats. It is up to the user to know which format his file(s) is/are inprior to starting the program. The following file extensions are used by many of the programs:

file extension	format
.dna	free format, nucleic acid
.fsa, .wrp	nucleic acid or protein, Pearson/FASTA
.gen, .gb	GENBANK, nucleic acid
.pro	free format, protein
.nbrf	NBRF/PIR, protein
.raw	nucleotides or amino acids only
.phy	Phylip
.gde	GDE, bioLegato
.flat	GDE, bioLegato flat files

Suggested file extensions for other types of files include:

file extension	format
.txt, .asc	ASCII text
.ps	PostScript
.pdf	Adobe PDF
.odw, .sxw .ods, .sxc .odp	OpenOffice Writer, StarWriter OpenOffice Calc, StarCalc OpenOffice Impress, Star Impress
.fig	Xfig
.obj	TGIF
.png	PNG
.gif	GIF
.jpg	JPEG
.tif	TIFF

Although some errors in input file format will be caught, others mayresult in the interpretation of legal nucleotide or amino acid charactersas being incorrectly added to the sequence. For example, if the user attemptedto read in a GenBank file as a free format file, any legal nucleotide charactersin the sequence name would be added to the sequence.

Note: Raw, Free-format, and Pearson/fasta format files do not have a way to specify when sequences are circular!GenBank format allows you to specify 'Circular' on the LOCUS line.

1. Raw (unformatted) - contains nothing but nucleotides or amino acids

Example:

guaagcuaugaagggagggaugauguuuauggugaauuggaacccagagg
cuuaacaggugacucguuaagaaaacuaccaugcuauaucaugucaagug
agaugaccaagaagcaaaucauucacugcacuauuugucuucaggacauu
gcaguaggcgaaaucacacgaaguuuaccgagaugugaccauacguuuca
ccugguuuguguugauaaauggcucaucagacauggaucaugccccauuu
gcagacaggccguuaaagauuaaaaagcaccauugguguccgaggagugu
acguagcaaaaauccauuguccuuauauguuguuguaagucucugaaucc
uuguuuuagucucucuuuguuacuuuuacuuauagcaucauccauagguu
ucuacuuuugaauguauacuauuguagacaugaauaauancaccuacagu
uauguuggagaaaaaaauauagaacucagauuaaguuaugcacug

2. Free format , nucleic acids or proteins, used by FSAP programs

There are only three rules to making a free-format file:

a. Comments are denoted by a semicolon (;). When a semicolon is encountered, the rest of the line is ignored by the programs.
b. Outside of comments, only legal sequence symbols are read as sequence. Other characters (eg. blanks, numerals) are ignored.
c. DNA and RNA symbols may be in either upper or lowercase.

Sequences can be typed into a file in any arrangement that is convenient. Blank spaces between bases or amino acids are ignored, and a sequence may run over many lines. Thus, you may skip a space every five or ten basesto make proofreading of the file easier. Any letters that are not in theabove lists of symbols are ignored. This makes it possible to intersperse numbers and other symbols with the actual nucleotides or amino acids to be read in. Comments can be inclueded anywhere in the file to annotate the sequence. Their importance can be likened to comments you might writein your laboratory notebook. Without comments, data you entered a year ago may be meaningless. Below is a sample datafile:

     ; pEX-A  10/31/81
     ; Sequence from experiment #185, starts at EcoRI site.
              10          20    30          40         50
     AATTC CGGTT CCTTA TTAAC AAATT CCCTT CCCTT CCCCC GGTTA
              60    
     CCACA GAATT GATTC ;Hinf I site
     ; Experiment #186
     cccta ggcca aattg gaTTC CNTTA NNCCC GGGAC TTACA GACTA
     CCTAG GACCG TTCGG TTACT ACTTC TCAGA AGACT GACTA CGGCT
     AAAAA AAA

3. Pearson/FASTA - most common format other than GenBank. The vast majority of programs can read this format.

The format is

>name - definition
sequence line
sequence line
sequence line
etc...

Example of a FASTA format file containing 2 sequences:

>AI352966 - MB75-5H PZ204.BNlib Brassica napus cDNA clone pMB75-5H
guaagcuaugaagggagggaugauguuuauggugaauuggaacccagagg
cuuaacaggugacucguuaagaaaacuaccaugcuauaucaugucaagug
agaugaccaagaagcaaaucauucacugcacuauuugucuucaggacauu
gcaguaggcgaaaucacacgaaguuuaccgagaugugaccauacguuuca
ccugguuuguguugauaaauggcucaucagacauggaucaugccccauuu
gcagacaggccguuaaagauuaaaaagcaccauugguguccgaggagugu
acguagcaaaaauccauuguccuuauauguuguuguaagucucugaaucc
uuguuuuagucucucuuuguuacuuuuacuuauagcaucauccauagguu
ucuacuuuugaauguauacuauuguagacaugaauaauancaccuacagu
uauguuggagaaaaaaauauagaacucagauuaaguuaugcacug
>AI352968 - pMB75-7B 5' similar to chloroplast 23S rRNA|Chloroplast Oryza sativa|X15901
GGCTTACGGTGGATACCTAGGCACCCAGAGACGAGGAAGGGCGTAGTAAGCGACGAAATGCTTCGGGGAG
TTGAAAATAAGCGTAGATCCGGAGATTCCCGAATAGGTTAACCTTTTGAACTGCTGCTGAATCCATGGGC
AGGCAAGAGACAACCTGGCGAACTGAAACATCTTAGTAGCCAGAGGAAAAGAAAGCAAAAGCGATTCCCG
TAGTAGCGGCGAGCGAAATGGGAGCAGCCTAAACCGTGAAAACGGGGTTGTGGGAGAGCAATAAAAGCGT
CGTGCTGCTAGGCGAAGCGGTGGGAGTGCCGCACCCTAGATGGCGAGAGTCCAGTAGCCGAAAGCATCAC
TAGCTTATGCTCTGACCCCGAGTAGCATGGGGNCACGTGGATCCCCGTGTGAATCAGCAAGGACCACCTT
GCAAGGCTAAATACTCCTTGGGTGACCGATGCCGAAAGTAGTTACCCGTAGGGAAAGGGTTGAAAAAGAA
CCCCATGGNGGGAGTTGAAATAGAACATGANAACCGTAAGCTCCAAAGCAGTGGGGAGGAACCTGGGCTT
GNCCGCGTGNCCTGTTGAAGAATGAACCGGCGACTT

This example illustrates the flexibility of the FASTA format, allowing DNA, RNA or protein sequences, with no limit on line length or the number of sequences in the file.

4. GENBANK

Click here for sample GenBank file

The format for the current GenBank release can be found in the GenBank Flat File Release notes at [ https://ftp.ncbi.nih.gov/genbank/gbrel.txt].
A formal definition of the GenBank Features Table can be found in TheDDBJ/EMBL/GenBank Feature Table: Definition [ http://www.ncbi.nlm.nih.gov/collab/FT/.

5. UniProt/SwissProt

Click here for a sample SwissProt file (P27518)

The Uniprot database unifies protein resources from several databases, including the SwissProt and PIR.