ChangeLog - FASTA v36


 $Id: changes_v36.html $


Latest Updates - FASTA version 36.3.8h (March, 2019)

  1. The FASTA programs have been released under the Apache2.0 Open Source License. The COPYRIGHT file, and copyright notices in program files, have been updated to reflect this change.

  2. fasta-36.3.8h includes bug fixes for translated alignments with termination codons, the ability to use scripts as query and library sequences, and new scripts for extracting genomic DNA sequences given chromosome coordinates.
  3. fasta-36.3.8g includes bug fixes for sub-alignment scoring and psisearch2 scripts, new annotation scripts for exons, and fixes enabling very low statistical thresholds with ggsearch36 and glsearch36.
  4. fasta-36.3.8e/scripts includes updated scripts for capturing domain and feature annotations using the EBI/proteins API (https://www.ebi.ac.uk/proteins/api/) to get Uniprot annotations and exon locations.

  5. The fasta-36.3.8e/psisearch2/ directory now provides psisearch2_msa.pl and psisearch2_msa.py, functionally identical scripts for iterative searching with psiblast or ssearch36. psisearch2-msa.pl offers an option, --query_seed, that can dramatically reduce false-positives caused by alignment overextension, with very little loss of search sensitivity.

  6. The fasta-36.3.8d/scripts/ directory now provides a script, annot_blast_btop2.pl that allows annotations and sub-alignment scoring on BLAST alignments that use the tabular format with BTOP alignment encoding.

  7. Alignment sub-scoring scripts have been extended to allow overlapping domains. This requires a modified annotation file format. The "classic" format placed the beginning and end of a domain on different lines:
        1   [    -     GST_N
       88   ]    -
       90   [    -     GST_C
      208   ]    - 
    
    Since the closing "]" was associated with the previous "[", domains could not overlap.

    The new format is:

        1   -    88     GST_N
       90   -   208    GST_C
    
    which allows annotations of the form:
        1   -    88    GST_N
       75   -   123    GST-middle
       90   -   208    GST_C
    

  8. New annotation scripts are available in the fasta-36.3.8/scripts directory, e.g. ann_pfam_www_e.pl (Pfam) and ann_up_www2_e.pl (Uniprot) to support this new format. If the domain annotations provided by Pfam or Uniprot overlap, then overlapping domains are provided. The _e.pl new scripts can be directed to provide non-overlapping domains, using the boundary averaging strategy in the older scripts, by specifying the --no-over option.

Updates - FASTA version 36.3.6f (August, 2014)

FASTA version 36.3.6f extends previous versions in several ways:

  1. There is a new command line option, -XI, that causes the alignment programs to report 100% identity only when there are no mismatches. In previous versions, one mismatch in 10,000 would round up to 100.0% identity; with -XI, the identity will be reported as 99.9%.
  2. The option to provide alignment encodings (-m 9c, or -m 9C forCIGAR strings) has been extended to provide mis-match information in the alignment encoding using the -m 9d (classic FASTA alignment encoding) or -m 9D (CIGAR string). For protein alignments, which are often < 40% identity, enabling mismatch encoding produces very long CIGAR strings.
  3. Provide more scripts for annotating proteins using either UniProt or Pfam web resources.

Additional bug fixes are documented in fasta-36.3.6f/doc/readme.v36

Updates - FASTA version 36.3.6 (July, 2013)

FASTA version 36.3.6 provides two new features:

  1. A new script-based strategy for including annotation information.
  2. Domain annotation information can be used to produce partition the alignment, and partition the scores of the alignment (sub-alignment scores). Sub-alignment scores can be used to identify regions of alignment over-extension, where a homologous domain aligns, but the alignment extends beyond the homologous region into an adjacent non-homologous domain.
Several scripts are provided (e.g. scripts/ann_feats_up_www.pl) that can be used to add Uniprot feature and domain annotations to searches of SwissProt and Uniprot.

(fasta-36.3.5 January 2013) The NCBI's transition from BLAST to BLAST+ several years ago broke the ability of ssearch36 to use PSSMs, because psiblast did not produce the binary ASN.1 PSSMs that ssearch36 could parse. With the January 2013 fasta-36.3.5f, release ssearch36 can read binary ASN.1 PSSM files produced by the NCBI datatool utility. See fasta_guide.pdf for more information (look for the -P option).


Summary - Major Changes in FASTA version 36.3.5 (May, 2011)

  1. By default, the FASTA36 programs are no longer interactive. Typing fasta36 presents a short help message, and fasta36 -help presents a complete list of options. To see the interactive prompts, use fasta36 -I.

    Likewise, the score histogram is no longer shown by default; use the -H option to show the histogram (or compile with -DSHOW_HIST for previous behavior).

    The _t (fasta36_t) versions of the programs are built automatically on Linux/MacOSX machines and named fasta36, etc. (the programs are threaded by default, and only one program version is built).

    Documentation has been significantly revised and updated. See doc/fasta_guide.pdf for a description of the programs and options.

  2. Display of all significant alignments between query and library sequence. BLAST has always displayed multiple high-scoring alignments (HSPs) between the query and library sequence; previous versions of the FASTA programs displayed only the best alignment, even when other high-scoring alignments were present. This is the major change in FASTA36. For most programs (fasta36, ssearch36, [t]fast[xy]36), if the library sequence contains additional significant alignments, they will be displayed with the alignment output, and as part of -m 9 output (the initial list of high scores).

    By default, the statistical threshold for alternate alignments (HSPs) is the E()-threshold / 10.0. For proteins, the default expect threshold is E()< 10.0, the secondary threshold for showing alternate alignments is thus E() < 1.0. Fror translated comparisons, the E()-thresholds are 5.0/0.5; for DNA:DNA 2.0/0.2.

    Both the primary and secondary E()-thresholds are set with the -E "prim sec" command line option. If the secondary value is betwee zero and 1.0, it is taken as the actual threshold. If it is > 1.0, it is taken as a divisor for the primary threshold. If it is negative, alternative alignments are disabled and only the best alignment is shown.

  3. New statistical options, -z 21, 22, 26, provide a second E()-value estimate based on shuffles of the highest scoring sequences.

  4. New output options. -m 8 provides the same output format as tabular BLAST; -m 8C mimics tabular blast with comment lines. -m 9C provides CIGAR encoded alignments.

    (fasta-36.3.4) Alignment option -m B provides BLAST-like alignments (no context, coordinates at the beginning and end of the alignment line, Query/Sbjct.

  5. Improved performance using statistics based thresholds for gap-joining and band-optimization in the heuristic FASTA local alignment programs (fasta36, [t]fast[xy]36). By default (fasta36.3) fasta36, [t]fast[xy]36 can use a similar strategy to BLAST to set the thresholds for combining ungapped regions and performing band alignments. This dramatically reduces the number of band alignments performed, for a speed increase of 2 - 3X. The original statistical thresholds can be enabled with the -c O (upper-case letter 'O') command line option. Protein and translated protein alignment programs can also use ktup=3 for increased speed, though ktup=2 is still the default.

    Statistical thresholds can dramatically reduce the number of "optimized" scores, from which statistical estimates are calculated. To address this problem, the statistical estimation procedure has been adjusted to correct for the fraction of scores that were optimized. This process can dramatically improve statistical accuracy for some matrices and gap pentalies, e.g. BLOSUM62 -11/-1.

    With the new joining thresholds, the -c "E-opt E-join" options have expanded meanings. -c "E-opt E-join" calculates a threshold designed (but not guaranteed) to do band optimization and joining for that fraction of sequences. Thus, -c "0.02 0.1" seeks to do band optimization (E-opt) on 2% of alignments, and joining on 10% of alignments. -c "40 10" sets the gap threshold as in earlier versions.

  6. A new option (-e expand_script.sh) is available that allows the set of sequences that are aligned to be larger than the set of sequences searched. When the -e expand_script.sh option is used, the expand_script.sh script is run with an input argument that is a file of accession numbers and E()-values; this information can be used to produce a fasta-formatted list of additional sequences, which will then be compared and aligned (if they are significant), and included in the list of high scoring sequences and the alignments. The expanded set of sequences does not change the database size o statisical parameters, it simply expands the set of high-scoring sequences.

  7. The -m F option can be used to produce multiple output formats in different files from the same search. For example, -m "F9c,10 m9c10.output" -m "FBB blastBB.output" produces two output files in addition to the normally formatted output sent to stdout. The m9c10.output file contains -m 9c score descriptions and -m 10 alignments, while blastBB.output contains BLAST-like output (-m BB).

  8. Scoring matrices can vary with query sequence length. In large-scale searches with metagenomics reads, some reads may be too short to produce statistically significant scores against comprehensive databases (e.g. a DNA read of 90 nt is translated into 30 aa, which would require a scoring matrix with at least 1.3 bits/position to produce a 40 bit score). fasta-36.3.* includes the option to specify a "variable" scoring matrix by including '?' as the first letter of the scoring matrix abbreviation, e.g. fasta36_t -q -s '?BP62' would use BP62 for sequences long enough to produce significant alignment scores, but would use scoring matrices with more information content for shorter sequences. The FASTA programs include BLOSUM50 (0.49 bits/pos) and BLOSUM62 (0.58 bits/pos) but can range to MD10 (3.44 bits/position). The variable scoring matrix option searches down the list of scoring matrices to find one with information content high enough to produce a 40 bit alignment score. (Several bugs in the process are fixed in fasta-36.3.2.)

  9. Several less-used options (-1, -B, -o, -x, -y) have become extended options, available via the -X (upper case X) option. The old -X off1,off2 option is now -o off1,off2.

    By default, the program will read up to 2 GB (32-bit systems) or 12 GB (64-bit systems) of the database into memory for multi-query searches. The amount of memory available for databases can be set with the -XM4G option.

  10. Much greater flexibility in specifying combinations of library files and subsets of libraries. It has always been possible to search a list of libraries specified by an indirect (@) file; the FASTA36 programs can include indirect files of library names inside of indirect files of library names.

  11. fasta-36.3.2 ggsearch36 (global/global) and glsearch36 now incorporate SSE2 accelerated global alignment, developed by Michael Farrar. These programs are now about 20-fold faster.

  12. fasta-36.2.1 (and later versions) are fully threaded, both for searches, and for alignments. The programs routinely run 12 - 15X faster on dual quad-core machines with "hyperthreading".

Summary - Major Changes in FASTA version 35 (August, 2007)

  1. Accurate shuffle based statistics for searches of small libraries (or pairwise comparisons).

  2. Inclusion of lalign35 (SIM) into FASTA3. Accurate statistics for lalign35 alignments. plalign has been replaced by lalign35 and lav2ps.
  3. Two new global alignment programs: ggsearch35 and glsearch35.

February 7, 2008

Allow annotations in library, as well as query sequences. Currently, annotations are only available within sequences (i.e., they are not read from the feature table), but they should be available in FASTA format, or any of the other ascii text formats (EMBL/Swissprot, Genbank, PIR/GCG). If annotations are present in a library and the annotation characters includes '*', then the -V '*' option MUST be used. However, special characters other than '*' are ignored, so annotations of '@', '%', or '@' should be transparent.

In translated sequence comparisons, annotations are only available for the protein sequence.

January 25, 2007

Support protein queries and sequence libraries that contain 'O' (pyrrolysine) and 'U' (selenocysteine). ('J' was supported already). Currently, 'O' is mapped automatically to 'K' and 'U' to 'C'.

Dec. 13, 2007 CVS fa35_03_02m

Add ability to search a subset of a library using a file name and a list of accession/gi numbers. This version introduces a new filetype, 10, which consists of a first line with a target filename, format, and accession number format-type, and optionally the accession number format in the database, followed by a list of accession numbers. For example:

	  </slib2/blast/swissprot.lseg 0:2 4|
	  3121763
	  51701705
	  7404340
	  74735515
	  ...
Tells the program that the target database is swissprot.lseg, which is in FASTA (library type 0) format.

The accession format comes after the ":". Currently, there are four accession formats, two that require ordered accessions (:1, :2), and two that hash the accessions (:3, :4) so they do not need to be ordered. The number and character after the accession format (e.g. "4|") indicate the offset of the beginning of the accession and the character that terminates the accession. Thus, in the typical NCBI Fasta definition line:

 >gi|1170095|sp|P46419|GSTM1_DERPT Glutathione S-transferase (GST class-mu)
The offset is 4 and the termination character is '|'. For databases distributed in FASTA format from the European Bioinformatics Institute, the offset depends on the name of the database, e.g.
 >SW:104K_THEAN Q4U9M9 104 kDa microneme/rhoptry antigen precursor (p104).
and the delimiter is ' ' (space, the default).

Accession formats 1 and 3 expect strings; accession formats 2 and 4 work with integers (e.g. gi numbers).

December 10, 2007

Provide encoded annotation information with -m 9c alignment summaries. The encoded alignment information makes it much simpler to highlight changes in critical residues.

August 22, 2007

A new program is available, lav2svg, which creates SVG (Scalable Vector Graphics) output. In addition, ps_lav, which was introduced May 30, 2007, has been replaced by lav2ps. SVG files are more easily edited with Adobe Illustrator than postscript (lav2ps) files.

July 25, 2007 CVS fa35_02_02

Change default gap penalties for OPTIMA5 matrix to -20/-2 from -24/-4.

July 23, 2007

Add code to support to support sub-sequence ranges for "library" sequences - necessary for fully functional prss (ssearch35) and lalign35. For all programs, it is now possible to specify a subset of both the query and the library, e.g.
lalign35 -q mchu.aa:1-74 mchu.aa:75-148
Note, however, that the subset range applied to the library will be applied to every sequence in the library - not just the first - and that the same subset range is applied to each sequence. This probably makes sense only if the library contains a single sequence (this is also true for the query sequence file).

July 3, 2007 CVS fa35_02_01

Merge of previous fasta34 with development version fasta35.

June 26, 2007

Add amino-acid 'J' for 'I' or 'L'.

Add Mueller and Vingron (2000) J. Comp. Biol. 7:761-776 VT160 matrix, "-s VT160", and OPTIMA_5 (Kann et al. (2000) Proteins 41:498-503).

June 7, 2007

ggssearch35(_t), glsearch35(_t) can now use PSSMs.

May 30, 2007 CVS fa35_01_04

Addition of ps_lav (now lav2ps or lav2svg) -- which can be used to plot the lav output of lalign35 -m 11.
lalign35 -m 11 | lav2ps
replaces plalign (from FASTA2).

May 2, 2007

The labels on the alignment scores are much more informative (and more diverse). In the past, alignment scores looked like:
>>gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-transfer  (218 aa)
 s-w opt: 1497  Z-score: 1857.5  bits: 350.8 E(): 8.3e-97
Smith-Waterman score: 1497; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218)
^^^^^^^^^^^^^^
where the highlighted text was either: "Smith-Waterman" or "banded Smith-Waterman". In fact, scores were calculated in other ways, including global/local for fasts and fastf. With the addition of ggsearch35, glsearch35, and lalign35, there are many more ways to calculate alignments: "Smith-Waterman" (ssearch and protein fasta), "banded Smith-Waterman" (DNA fasta), "Waterman-Eggert", "trans. Smith-Waterman", "global/local", "trans. global/local", "global/global (N-W)". The last option is a global global alignment, but with the affine gap penalties used in the Smith-Waterman algorithm.

April 19, 2007 CVS fa34t27br_lal_3

Two new programs, ggsearch35(_t) and glsearch35(_t) are now available. ggsearch35(_t) calculates an alignment score that is global in the query and global in the library; glsearch35(_t) calculates an alignment that is global in the query and local, while local in the library sequence. The latter program is designed for global alignments to domains. Both programs assume that scores are normally distributed. This appears to be an excellent approximation for ggsearch35 scores, but the distribution is somewhat skewed for global/local (glsearch) scores. ggsearch35(_t) only compares the query to library sequences that are beween 80% and 125% of the length of the query; glsearch limits comparisons to library sequences that are longer than 80% of the query. Initial results suggest that there is relatively little length dependence of scores over this range (scores go down dramatically outside these ranges).

March 29, 2007 CVS fa34t27br_lal_1

At last, the lalign (SIM) algorithm has been moved from FASTA21 to FASTA35. A plalign equivalent is also available using lalign -m 11 | lav2ps or | lav2svg. The statistical estimates for lalign35 should be much more accurate than those from the earlier lalign, because lambda and K are estimated from shuffles. In addition, all programs can now generate accurate statistical estimates with shuffles if the library has fewer than 500 sequences. If the library contains more than 500 sequences and the sequences are related, then the -z 11 option should be used. p
FASTA v34 Change Log