ChangeLog - FASTA v34
$Id: changes_v34.html 120 2010-01-31 19:42:09Z wrp $
$Revision: 210 $
May 28, 2007
Small modification for GCG ASCII (libtype=5) header line.
October 6, 2006 CVS fa34t26b3
New Windows programs available using Intel C++ compiler. First
threaded programs for Windows; first SSE2 acceleration of SSEARCH for
Windows.
July 18, 2006 CVS fa34t26b2
More powerful environment variable substitutions for FASTLIBS files.
The library file name parsing programs now provide the option for
environment variable substitions. For example, SLIB2=/slib2 as an
environment variable (e.g. export SLIB2=/slib2 for ksh and bash), then
fasta34 -q query.aa '${SLIB2}/swissprot.fa' expands as expected.
While this is not important for command lines, where the Unix shell
would expand things anyway, it is very helpful for various
configuration files, such as files of file names, where:
<${SLIB2}/blast
swissprot.fa
now expands properly, and in FASTLIBS files the line:
NCBI/Blast Swissprot$0S${SLIB2}/blast/swissprot.fa
expands properly. Currently, Environment variable expansion only
takes place for library file names, and the <directory in a file of
file names.
July 2, 2006 fa34t26b0
This release provides an extremely efficient SSE2 implementation of
the Smith-Waterman algorithm for the SSE2 vector instructions written
by Michael Farrar (farrar.michael@gmail.com). The SSE code speeds up
Smith-Waterman 8 - 10-fold in my tests, making it comparable to Eric
Lindahl's Altivec code for the Apple/IBM G4/G5 architecture.
May 24, 2006 fa34t25d8
In addition, support for ASN.1 PSSM:2 files provided by the NCBI
PSI-BLAST WWW site is included. This code will not work with
iteration 0 PSSM's (which have no PSSM information). For ASN.1
PSSM's, which provide the matrix name (and in some cases the gap
penalties), the scoring matrix and gap penalties are set appropriately
if they were not specified on the command line. ASN.1 PSSM's are type 2:
ssearch34 -P "pssm.asn1 2" .....
May 18, 2006
Support for NCBI Blast formatdb databases has been expanded. The
FASTA programs can now read some NCBI *.pal and *.nal files, which are
used to specify subsets of databases. Specifically, the
swissprot.00.pal and pdbaa.00.pal files are supported. FASTA supports
files that refer to *.msk files (i.e. swissprot.00.pal refers to
swissprot.00.msk); it does not currently support .pal files that
simply list other .pal or database files (e.g. FASTA does not support
nr.pal or swissprot.pal).
Nov 20, 2005
Changes to support asymmetric matrices - a scoring matrix read in from
a file can be asymmetric. Default matrices are all symmetric.
Sept 2, 2005
The prss34 program has been modified to use the same display routines
as the other search programs. To be more consistent with the other
programs, the old "-w shuffle-window-size" is now "-v window-size".
prss34/prfx34 will also show the optimal alignment for which the
significance is calculated by using the "-A" option.
Since the new program reports results exactly like other
fasta/ssearch/fastxy34 programs, parsing for statistical significance
is considerably different. The old format program can be make using
"make prss34o".
May 5, 2005 CVS fa34t25d1
Modification to the -x option, so that both an "X:X" match score and
an "X:not-X" mismatch score can be specified. (This score is also used
to give a positive score to a "*:*" match - the end of a reading frame,
while giving a negative score to "*:not-*".
Jan 24, 2005
Include a new program, "print_pssm", which reads a blastpgp binary
checkpoint file and writes out the frequency values as text. These
values can be used with a new option with ssearch34(_t) and prss34,
which provides the ability to read a text PSSM file. To specify a
text PSSM, use the option -P "query.ckpt 1" where the "1" indicates a
text, rather than a binary checkpoint file. "initfa.c" has also been
modified to work with PSSM files with zero's in the in the frequency
table. Presumably these positions (at the ends) do not provide
information. (Jan 26, 2005) blastpgp actually uses BLOSUM62 values
when zero frequencies are provided, so read_pssm() has been modified
to use scoring matrix values for zero frequencies as well.
Nov 4-8, 2004
Incorporation of Erik Lindahl "anti-diagonal" Altivec code for
Smith-Waterman, only. Altivec SSEARCH is now faster than FASTA for
Aug 25,26, 2004 CVS fa34t24b3
Small change in output format for
p34comp* programs in
">>>query_file#1 string" line before alignments. This line is not present
in the non-parallel versions - it would be better for them to be consistent.
Dec 10, 2003 CVS fa34t23b3
Cause default ktup to drop for short query sequences. For protein queries < 50,
ktup=1;
for DNA queries < 20, 50, 100
ktup = 1, 2, 3, respectively.
Dec 7, 2003
A new option, "-U" is available for RNA sequence comparison. "-U"
functions like "-n", indicating that the query is an RNA sequence. In
addition, to account for "G:U" base pairs, "-U" modifies the scoring
matrices so that a "G:A" match has the same score as a "G:G" match,
and "T:C" match has the same score as a "T:T" match.
Nov 2, 2003
Support for more sophisticated display options. Previously, one could
have only on "-m #" option, even though several of the options were
orthogonal (-m 9c is independent of -m 1 and -m2, which is independent
of -m 6 (HTML)). In particular -m 9c can be combined with -m 6, which
can be very helpful for runs that need HTML output but can also
exploit the encoding provided by -m 9c.
The "-m 9" option now also allows "-m 9i", which shows the standard
best score information, plus percent identity and alignment length.
Sept 25, 2003
A new option is available for annotating alignments. -V '@#?!'
can be used to annotate sites in a sequence, e.g:
>GTM1_HUMAN ...
PMILGYWDIRGLAHAIRLLLEYTDS@S?YEEKKYT@MG
DAPDYDRS@QWLNEKFKLGLDFPNLPYLIDGAHKIT
might mark known and expected (S,T) phosphorylation sites. These
symbols are then displayed on the query coordinate line:
10 20 @? 30 @ 40 @ 50 60
GTM1_H PMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLP
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
gtm1_h PMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLP
10 20 30 40 50 60
This annotation is mostly designed to display post-translational
modifications detected by MassSpec with FASTS, but is also available
with FASTA and SSEARCH.
June 16, 2003 version: fasta34t22
ssearch34 now supports PSI-BLAST PSSM/profiles. Currently, it only
supports the "checkpoint" file produced by blastall, and only on
certain architectures where byte-reordering is unnecessary. It has not
been tested extensively with the -S option.
ssearch34 -P blast.ckpt -f -11 -g -1 -s BL62 query.aa library
Will use the frequency information in the blast.chkpt file to do a
position specific scoring matrix (PSSM) search using the
Smith-Waterman algorithm. Because ssearch34 calculates scores for
each of the sequences in the database, we anticipate that PSSM
ssearch34 statistics will be more reliable than PSI-Blast statistics.
The Blast checkpoint file is mostly double precision frequency
numbers, which are represented in a machine specific way. Thus, you
must generate the checkpoint file on the same machine that you run
ssearch34 or prss34 -P query.ckpt. To generate a checkpoint file,
run:
blastpgp -j 2 -h 1e-6 -i query.fa -d swissprot -C query.ckpt -o /dev/null
(This searches swissprot for 2 iterations ("-j 2" using a E()
threshold 1e-6 saving the resulting position specific frequencies in
query.ckpt. Note that the original query.fa and query.ckpt must
match.)
Apr 11, 2003 CVS fa34t21b3
Fixes for "-E" and "-F" with ssearch34, which was inadvertantly disabled.
A new option, "-t t", is available to specify that all the protein
sequences have implicit termination codons "*" at the end. Thus, all
protein sequences are one residue longer, and full length matches are
extended one extra residue and get a higher score. For
fastx34/tfastx34, this helps extend alignments to the very end in
cases where there may be a mismatch at the C-terminal residues.
-m 9c has also been modified to indicate locations of termination
codons ( *1).
Mar 17, 2003 CVS fa34t21b2
A new option on scoring matrices "-MS" (e.g. "BL50-MS") can be used to
turn the I/L, K/Q identities on or off. Thus, to make "fastm34" use
the isobaric identities, use "-s M20-MS". To turn them off for "fasts34",
use "-s M20".
Jan 25, 2003
Add option "-J start:stop" to pv34comp*/mp34comp*. "-J x" used to
allow one to start at query sequence "x"; now both start and stop can
be specified.
Nov 14-22, 2002 CVS fa34t20b6
Include compile-time define (-DPGM_DOC) that causes all the fasta
programs to provide the same command line echo that is provided by the
PVM and MPI parallel programs. Thus, if you run the program:
fasta34_t -q -S gtt1_drome.aa /slib/swissprot 12
the first lines of output from FASTA will be:
# fasta34_t -q gtt1_drome.aa /slib/swissprot
FASTA searches a protein or DNA sequence data bank
version 3.4t20 Nov 10, 2002
Please cite:
W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
This has been turned on by default in most FASTA Makefiles.
Aug 27, 2002
Modifications to mshowbest.c and drop*.c (and p2_workcomp.c,
compacc.c, doinit.c, etc.) to provide more information about the
alignment with the -m 9 option. There is now a "-m 9c" option, which
displays an encoded alignment after the -m 9 alignment information.
The encoding is a string of the form: "=#mat+#ins=#mat-#del=#mat".
Thus, an alignment over 218 amino acids with no gaps (not necessarily
100% identical) would be =218. The alignment:
10 20 30 40 50 60 70
GT8.7 NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
:.:: . :: :: . .::: : .: ::.: .: : ..:.. ::: :..:
XURTG NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
20 30 40 50 60
would be encoded: "=23+9=13-2=10-1=3+1=5". The alignment encoding is
with respect to the beginning of the alignment, not the beginning of
either sequence. The beginning of the alignment in either sequence is
given by the an0/an1 values. This capability is particularly useful
for [t]fast[xy], where it can be used to indicate frameshift positions
"/#\#" compactly. If "-m 9c" is used, the "The best scores" title
line includes "aln_code".
Aug 14, 2002 CVS tag fa34t20
Changes to nmgetlib.c to allow multiple query searches coming from
STDIN, either through pipes or input redirection. Thus, the command
cat prot_test.lseg | fasta34 -q -S @ /seqlib/swissprot
produces 11 searches. If you use the multiple query functions, the
query subset applies only to the first sequence.
Unfortunately, it is not possible to search against a STDIN library,
because the FASTA programs do not keep the entire library in memory
and need to be able to re-read high-scoring library sequences. Since
it is not possible to fseek() against STDIN, searching against a STDIN
library is not possible.
Aug 5, 2002
fasts34(_t) and
fastm34(_t) have been modified to allow searches with
DNA sequences. This gives a new capability to search for DNA motifs,
or to search for ordered or unordered DNA sequences spaced at
arbitrary distances.
June 25, 2002
Modify the statistical estimation strategy to sample all the sequences
in the database, not just the first 60,000. The histogram is still
based only on the first 60,000 scores and lengths, though all scores
an lengths are shown. The fit to the data may be better than the
histogram indicates, but it should not be worse.
June 19, 2002
Added "-C #" option, where 6 <= # <= MAX_UID (20), to specify the
length of the sequence name display on the alignment labels. Until
now, only 6 characters were ever displayed. Now, up to MAX_UID
characters are available.
Mar 16, 2002
Added create_seq_demo.sql, nt_to_sql.pl to show how to build an SQL
protein sequence database that can be used with with the mySQL
versions of the fasta34 programs. Once the mySQL seq_demo database
has been installed, it can be searched using the command:
fasta34 -q mgstm1.aa "seq_demo.sql 16"
mysql_lib.c has been modified to remove the restriction that mySQL
protein sequence unique identifiers be integers. This allows the
program to be used with the PIRPSD database. The RANLIB() function
call has been changed to include "libstr", to support SQL text keys.
Due to the size of libstr[], unique ID's must be < MAX_UID (20)
characters.
A "pirpsd.sql" file is available for searching the mySQL distribution
of the PIRPSD database. PIRPSD is available from
ftp://nbrfa.georgetown.edu/pir_databases/psd/mysql.