ChangeLog - FASTA v36
$Id: changes_v36.html 773 2011-05-31 14:31:33Z wrp $
$Revision: 210 $
Summary - Major Changes in FASTA version 36.3.5 (May, 2011)
-
By default, the FASTA36 programs are no longer interactive. Typing
fasta36 presents a short help message, and
fasta36 -help presents a complete list of options. To see the interactive prompts, use
fasta36 -I.
Likewise, the score histogram is no longer shown by default; use
the -H option to show the histogram (or compile with
-DSHOW_HIST for previous behavior).
The _t (fasta36_t) versions of the programs are
built automatically on Linux/MacOSX machines and
named fasta36, etc. (the programs are threaded by default,
and only one program version is built).
Documentation has been significantly revised and updated.
See doc/fasta_guide.pdf for a description of the programs and options.
-
Display of all significant alignments between query and library
sequence. BLAST has always displayed multiple high-scoring
alignments (HSPs) between the query and library sequence; previous
versions of the FASTA programs displayed only the best alignment,
even when other high-scoring alignments were present. This is the
major change in FASTA36. For most programs
(fasta36, ssearch36,
[t]fast[xy]36), if the library sequence contains additional
significant alignments, they will be displayed with the alignment
output, and as part of -m 9 output (the initial list of high
scores).
By default, the statistical threshold for alternate alignments
(HSPs) is the E()-threshold / 10.0. For proteins, the default
expect threshold is E()< 10.0, the secondary threshold for showing
alternate alignments is thus E() < 1.0. Fror translated
comparisons, the E()-thresholds are 5.0/0.5; for DNA:DNA 2.0/0.2.
Both the primary and secondary E()-thresholds are set with the
-E "prim sec" command line option. If the secondary
value is betwee zero and 1.0, it is taken as the actual
threshold. If it is > 1.0, it is taken as a divisor for the primary
threshold. If it is negative, alternative alignments are disabled
and only the best alignment is shown.
-
New statistical options, -z 21, 22, 26, provide a second E()-value
estimate based on shuffles of the highest scoring sequences.
-
New output options. -m 8 provides the same output format as
tabular BLAST; -m 8C mimics tabular blast with comment
lines. -m 9C provides CIGAR encoded alignments.
(fasta-36.3.4) Alignment option -m B provides BLAST-like alignments (no context, coordinates at the beginning and end of the alignment line, Query/Sbjct.
-
Improved performance using statistics based thresholds for
gap-joining and band-optimization in the heuristic FASTA local
alignment programs (fasta36, [t]fast[xy]36). By
default (fasta36.3) fasta36, [t]fast[xy]36 can use
a similar strategy to BLAST to set the thresholds for combining
ungapped regions and performing band alignments. This dramatically
reduces the number of band alignments performed, for a speed increase
of 2 - 3X. The original statistical thresholds can be enabled with
the -c O (upper-case letter 'O') command line option.
Protein and translated protein alignment programs can also use ktup=3
for increased speed, though ktup=2 is still the default.
Statistical thresholds can dramatically reduce the number of
"optimized" scores, from which statistical estimates are calculated.
To address this problem, the statistical estimation procedure has
been adjusted to correct for the fraction of scores that were
optimized. This process can dramatically improve statistical accuracy
for some matrices and gap pentalies, e.g. BLOSUM62 -11/-1.
With the new joining thresholds, the
-c "E-opt E-join" options have expanded meanings. -c "E-opt E-join"
calculates a threshold designed (but not guaranteed) to do band
optimization and joining for that fraction of sequences. Thus, -c
"0.02 0.1" seeks to do band optimization (E-opt) on 2% of alignments,
and joining on 10% of alignments. -c "40 10" sets the gap
threshold as in earlier versions.
-
A new option (-e expand_script.sh) is available that allows
the set of sequences that are aligned to be larger than the set of
sequences searched. When the -e expand_script.sh option is
used, the expand_script.sh script is run with an input
argument that is a file of accession numbers and E()-values; this
information can be used to produce a fasta-formatted list of
additional sequences, which will then be compared and aligned (if they
are significant), and included in the list of high scoring sequences
and the alignments. The expanded set of sequences does not change the
database size o statisical parameters, it simply expands the set of
high-scoring sequences.
-
The -m F option can be used to produce multiple output formats in different files from the same search. For example, -m "F9c,10 m9c10.output" -m "FBB blastBB.output" produces two output files in addition to the normally formatted output sent to stdout. The m9c10.output file contains -m 9c score descriptions and -m 10 alignments, while blastBB.output contains BLAST-like output (-m BB).
-
Scoring matrices can vary with query sequence length. In large-scale
searches with metagenomics reads, some reads may be too short to
produce statistically significant scores against comprehensive
databases (e.g. a DNA read of 90 nt is translated into 30 aa, which
would require a scoring matrix with at least 1.3 bits/position to
produce a 40 bit score). fasta-36.3.* includes the option to specify
a "variable" scoring matrix by including '?' as the first letter of
the scoring matrix abbreviation, e.g. fasta36_t -q -s '?BP62' would
use BP62 for sequences long enough to produce significant alignment
scores, but would use scoring matrices with more information content
for shorter sequences. The FASTA programs include BLOSUM50 (0.49
bits/pos) and BLOSUM62 (0.58 bits/pos) but can range to MD10 (3.44
bits/position). The variable scoring matrix option searches down the
list of scoring matrices to find one with information content high
enough to produce a 40 bit alignment score. (Several bugs in the
process are fixed in fasta-36.3.2.)
-
Several less-used options
(-1, -B, -o, -x, -y) have
become extended options, available via the -X (upper case X) option.
The old -X off1,off2 option is now -o off1,off2.
By default, the program will read up to 2 GB (32-bit systems) or 12 GB
(64-bit systems) of the database into memory for multi-query searches.
The amount of memory available for databases can be set with
the -XM4G option.
-
Much greater flexibility in specifying combinations of library files
and subsets of libraries. It has always been possible to search a
list of libraries specified by an indirect (@) file; the FASTA36
programs can include indirect files of library names inside of
indirect files of library names.
-
fasta-36.3.2 ggsearch36 (global/global)
and glsearch36 now incorporate SSE2 accelerated global
alignment, developed by Michael Farrar. These programs are now about
20-fold faster.
-
fasta-36.2.1 (and later versions) are fully threaded, both for
searches, and for alignments. The programs routinely run 12 - 15X
faster on dual quad-core machines with "hyperthreading".
Summary - Major Changes in FASTA version 35 (August, 2007)
- Accurate shuffle based statistics for searches of small libraries (or pairwise comparisons).
-
Inclusion of lalign35 (SIM) into FASTA3. Accurate statistics for
lalign35 alignments. plalign has been replaced by
lalign35 and lav2ps.
-
Two new global alignment programs: ggsearch35 and glsearch35.
February 7, 2008
Allow annotations in library, as well as
query sequences. Currently, annotations are only available within
sequences (i.e., they are not read from the feature table), but they
should be available in FASTA format, or any of the other ascii text
formats (EMBL/Swissprot, Genbank, PIR/GCG). If annotations are
present in a library and the annotation characters includes '*', then
the -V '*' option MUST be used. However, special characters other
than '*' are ignored, so annotations of '@', '%', or '@' should be
transparent.
In translated sequence comparisons, annotations are only available for
the protein sequence.
January 25, 2007
Support protein queries and sequence
libraries that contain 'O' (pyrrolysine) and 'U' (selenocysteine).
('J' was supported already). Currently, 'O' is mapped automatically to
'K' and 'U' to 'C'.
Dec. 13, 2007 CVS fa35_03_02m
Add ability to search a subset of a library using a file name and a
list of accession/gi numbers. This version introduces a new filetype,
10, which consists of a first line with a target filename, format, and
accession number format-type, and optionally the accession number
format in the database, followed by a list of accession numbers. For
example:
</slib2/blast/swissprot.lseg 0:2 4|
3121763
51701705
7404340
74735515
...
Tells the program that the target database is swissprot.lseg, which is
in FASTA (library type 0) format.
The accession format comes after the ":". Currently, there are four
accession formats, two that require ordered accessions (:1, :2), and
two that hash the accessions (:3, :4) so they do not need to be
ordered. The number and character after the accession format
(e.g. "4|") indicate the offset of the beginning of the accession and
the character that terminates the accession. Thus, in the typical
NCBI Fasta definition line:
>gi|1170095|sp|P46419|GSTM1_DERPT Glutathione S-transferase (GST class-mu)
The offset is 4 and the termination character is '|'. For databases
distributed in FASTA format from the European Bioinformatics
Institute, the offset depends on the name of the database, e.g.
>SW:104K_THEAN Q4U9M9 104 kDa microneme/rhoptry antigen precursor (p104).
and the delimiter is ' ' (space, the default).
Accession formats 1 and 3 expect strings; accession formats 2 and 4
work with integers (e.g. gi numbers).
December 10, 2007
Provide encoded annotation information with
-m 9c alignment summaries. The encoded alignment information makes it
much simpler to highlight changes in critical residues.
August 22, 2007
A new program is
available,
lav2svg, which creates SVG (Scalable Vector
Graphics) output. In addition,
ps_lav,
which was introduced May 30, 2007, has been replaced
by
lav2ps. SVG files are more easily edited with Adobe
Illustrator than postscript (
lav2ps) files.
July 25, 2007 CVS fa35_02_02
Change default gap penalties for OPTIMA5 matrix to -20/-2 from -24/-4.
July 23, 2007
Add code to support to support sub-sequence ranges for "library"
sequences - necessary for fully functional prss (ssearch35) and
lalign35. For all programs, it is now possible to specify a subset of
both the query and the library, e.g.
lalign35 -q mchu.aa:1-74 mchu.aa:75-148
Note, however, that the subset range applied to the library will be
applied to every sequence in the library - not just the first - and
that the same subset range is applied to each sequence. This probably
makes sense only if the library contains a single sequence (this is
also true for the query sequence file).
July 3, 2007 CVS fa35_02_01
Merge of previous
fasta34 with development version
fasta35.
June 26, 2007
Add amino-acid 'J' for 'I' or 'L'.
Add Mueller and Vingron (2000) J. Comp. Biol. 7:761-776 VT160 matrix,
"-s VT160", and OPTIMA_5 (Kann et al. (2000) Proteins 41:498-503).
June 7, 2007
ggssearch35(_t),
glsearch35(_t) can now use PSSMs.
May 30, 2007 CVS fa35_01_04
Addition of
ps_lav
(now
lav2ps or
lav2svg) -- which can be used to plot the lav
output of
lalign35 -m 11.
lalign35 -m 11 | lav2ps
replaces
plalign
(from
FASTA2).
May 2, 2007
The labels on the alignment scores are much more informative (and more
diverse). In the past, alignment scores looked like:
>>gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-transfer (218 aa)
s-w opt: 1497 Z-score: 1857.5 bits: 350.8 E(): 8.3e-97
Smith-Waterman score: 1497; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218)
^^^^^^^^^^^^^^
where the highlighted text was either: "Smith-Waterman" or "banded
Smith-Waterman". In fact, scores were calculated in other ways,
including global/local for
fasts and
fastf. With the addition of
ggsearch35, glsearch35, and
lalign35, there are many more ways to
calculate alignments: "Smith-Waterman" (ssearch and protein fasta),
"banded Smith-Waterman" (DNA fasta), "Waterman-Eggert",
"trans. Smith-Waterman", "global/local", "trans. global/local",
"global/global (N-W)". The last option is a global global alignment,
but with the affine gap penalties used in the Smith-Waterman
algorithm.
April 19, 2007 CVS fa34t27br_lal_3
Two new programs,
ggsearch35(_t) and
glsearch35(_t) are now available.
ggsearch35(_t) calculates an alignment score that is global in the
query and global in the library;
glsearch35(_t) calculates an alignment
that is global in the query and local, while local in the library
sequence. The latter program is designed for global alignments to domains.
Both programs assume that scores are normally distributed. This
appears to be an excellent approximation for ggsearch35 scores, but
the distribution is somewhat skewed for global/local (glsearch)
scores.
ggsearch35(_t) only compares the query to library sequences
that are beween 80% and 125% of the length of the query; glsearch
limits comparisons to library sequences that are longer than 80% of
the query. Initial results suggest that there is relatively little
length dependence of scores over this range (scores go down
dramatically outside these ranges).
March 29, 2007 CVS fa34t27br_lal_1
At last, the
lalign (SIM) algorithm has been moved from
FASTA21 to
FASTA35. A
plalign
equivalent is also available using
lalign -m 11 | lav2ps
or
| lav2svg.
The statistical estimates for
lalign35 should be much more accurate
than those from the earlier lalign, because lambda and K are estimated
from shuffles.
In addition, all programs can now generate accurate statistical
estimates with shuffles if the library has fewer than 500 sequences.
If the library contains more than 500 sequences and the sequences are
related, then the -z 11 option should be used.
p
FASTA v34 Change Log