ChangeLog - FASTA v35
$Id: changes_v35.html 120 2010-01-31 19:42:09Z wrp $
$Revision: 210 $
Summary - Major Changes in FASTA version 35 (August, 2007)
- Accurate shuffle based statistics for searches of small libraries (or pairwise comparisons).
-
Inclusion of lalign35 (SIM) into FASTA3. Accurate statistics for
lalign35 alignments. plalign has been replaced by
lalign35 and lav2ps.
-
Two new global alignment programs: ggsearch35 and glsearch35.
February 7, 2008
Allow annotations in library, as well as
query sequences. Currently, annotations are only available within
sequences (i.e., they are not read from the feature table), but they
should be available in FASTA format, or any of the other ascii text
formats (EMBL/Swissprot, Genbank, PIR/GCG). If annotations are
present in a library and the annotation characters includes '*', then
the -V '*' option MUST be used. However, special characters other
than '*' are ignored, so annotations of '@', '%', or '@' should be
transparent.
In translated sequence comparisons, annotations are only available for
the protein sequence.
January 25, 2007
Support protein queries and sequence
libraries that contain 'O' (pyrrolysine) and 'U' (selenocysteine).
('J' was supported already). Currently, 'O' is mapped automatically to
'K' and 'U' to 'C'.
Dec. 13, 2007 CVS fa35_03_02m
Add ability to search a subset of a library using a file name and a
list of accession/gi numbers. This version introduces a new filetype,
10, which consists of a first line with a target filename, format, and
accession number format-type, and optionally the accession number
format in the database, followed by a list of accession numbers. For
example:
</slib2/blast/swissprot.lseg 0:2 4|
3121763
51701705
7404340
74735515
...
Tells the program that the target database is swissprot.lseg, which is
in FASTA (library type 0) format.
The accession format comes after the ":". Currently, there are four
accession formats, two that require ordered accessions (:1, :2), and
two that hash the accessions (:3, :4) so they do not need to be
ordered. The number and character after the accession format
(e.g. "4|") indicate the offset of the beginning of the accession and
the character that terminates the accession. Thus, in the typical
NCBI Fasta definition line:
>gi|1170095|sp|P46419|GSTM1_DERPT Glutathione S-transferase (GST class-mu)
The offset is 4 and the termination character is '|'. For databases
distributed in FASTA format from the European Bioinformatics
Institute, the offset depends on the name of the database, e.g.
>SW:104K_THEAN Q4U9M9 104 kDa microneme/rhoptry antigen precursor (p104).
and the delimiter is ' ' (space, the default).
Accession formats 1 and 3 expect strings; accession formats 2 and 4
work with integers (e.g. gi numbers).
December 10, 2007
Provide encoded annotation information with
-m 9c alignment summaries. The encoded alignment information makes it
much simpler to highlight changes in critical residues.
August 22, 2007
A new program is
available,
lav2svg, which creates SVG (Scalable Vector
Graphics) output. In addition,
ps_lav,
which was introduced May 30, 2007, has been replaced
by
lav2ps. SVG files are more easily edited with Adobe
Illustrator than postscript (
lav2ps) files.
July 25, 2007 CVS fa35_02_02
Change default gap penalties for OPTIMA5 matrix to -20/-2 from -24/-4.
July 23, 2007
Add code to support to support sub-sequence ranges for "library"
sequences - necessary for fully functional prss (ssearch35) and
lalign35. For all programs, it is now possible to specify a subset of
both the query and the library, e.g.
lalign35 -q mchu.aa:1-74 mchu.aa:75-148
Note, however, that the subset range applied to the library will be
applied to every sequence in the library - not just the first - and
that the same subset range is applied to each sequence. This probably
makes sense only if the library contains a single sequence (this is
also true for the query sequence file).
July 3, 2007 CVS fa35_02_01
Merge of previous
fasta34 with development version
fasta35.
June 26, 2007
Add amino-acid 'J' for 'I' or 'L'.
Add Mueller and Vingron (2000) J. Comp. Biol. 7:761-776 VT160 matrix,
"-s VT160", and OPTIMA_5 (Kann et al. (2000) Proteins 41:498-503).
June 7, 2007
ggssearch35(_t),
glsearch35(_t) can now use PSSMs.
May 30, 2007 CVS fa35_01_04
Addition of
ps_lav
(now
lav2ps or
lav2svg) -- which can be used to plot the lav
output of
lalign35 -m 11.
lalign35 -m 11 | lav2ps
replaces
plalign
(from
FASTA2).
May 2, 2007
The labels on the alignment scores are much more informative (and more
diverse). In the past, alignment scores looked like:
>>gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-transfer (218 aa)
s-w opt: 1497 Z-score: 1857.5 bits: 350.8 E(): 8.3e-97
Smith-Waterman score: 1497; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218)
^^^^^^^^^^^^^^
where the highlighted text was either: "Smith-Waterman" or "banded
Smith-Waterman". In fact, scores were calculated in other ways,
including global/local for
fasts and
fastf. With the addition of
ggsearch35, glsearch35, and
lalign35, there are many more ways to
calculate alignments: "Smith-Waterman" (ssearch and protein fasta),
"banded Smith-Waterman" (DNA fasta), "Waterman-Eggert",
"trans. Smith-Waterman", "global/local", "trans. global/local",
"global/global (N-W)". The last option is a global global alignment,
but with the affine gap penalties used in the Smith-Waterman
algorithm.
April 19, 2007 CVS fa34t27br_lal_3
Two new programs,
ggsearch35(_t) and
glsearch35(_t) are now available.
ggsearch35(_t) calculates an alignment score that is global in the
query and global in the library;
glsearch35(_t) calculates an alignment
that is global in the query and local, while local in the library
sequence. The latter program is designed for global alignments to domains.
Both programs assume that scores are normally distributed. This
appears to be an excellent approximation for ggsearch35 scores, but
the distribution is somewhat skewed for global/local (glsearch)
scores.
ggsearch35(_t) only compares the query to library sequences
that are beween 80% and 125% of the length of the query; glsearch
limits comparisons to library sequences that are longer than 80% of
the query. Initial results suggest that there is relatively little
length dependence of scores over this range (scores go down
dramatically outside these ranges).
March 29, 2007 CVS fa34t27br_lal_1
At last, the
lalign (SIM) algorithm has been moved from
FASTA21 to
FASTA35. A
plalign
equivalent is also available using
lalign -m 11 | lav2ps
or
| lav2svg.
The statistical estimates for
lalign35 should be much more accurate
than those from the earlier lalign, because lambda and K are estimated
from shuffles.
In addition, all programs can now generate accurate statistical
estimates with shuffles if the library has fewer than 500 sequences.
If the library contains more than 500 sequences and the sequences are
related, then the -z 11 option should be used.
p
FASTA v34 Change Log