Misc. Reference Manual Pages
                         FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local)


NAME
     fasta35, fasta35_t - scan a protein or DNA sequence  library
     for similar sequences

     fastx35, fastx35_t  - compare a DNA sequence  to  a  protein
     sequence  database, comparing the translated DNA sequence in
     forward and reverse frames.

     tfastx35, tfastx35_t  - compare a protein sequence to a  DNA
     sequence database, calculating similarities with frameshifts
     to the forward and reverse orientations.

     fasty35, fasty35_t  - compare a DNA sequence  to  a  protein
     sequence  database, comparing the translated DNA sequence in
     forward and reverse frames.

     tfasty35, tfasty35_t  - compare a protein sequence to a  DNA
     sequence database, calculating similarities with frameshifts
     to the forward and reverse orientations.

     fasts35, fasts35_t - compare unordered peptides to a protein
     sequence database

     fastm35, fastm35_t - compare ordered peptides (or short  DNA
     sequences) to a protein (DNA) sequence database

     tfasts35, tfasts35_t  -  compare  unordered  peptides  to  a
     translated DNA sequence database

     fastf35, fastf35_t - compare mixed  peptides  to  a  protein
     sequence database

     tfastf35,  tfastf35_t  -  compare  mixed   peptides   to   a
     translated DNA sequence database

     ssearch35, ssearch35_t - compare a protein or  DNA  sequence
     to a sequence database using the Smith-Waterman algorithm.

     ggsearch35, ggsearch35_t - compare a protein or DNA sequence
     to  a sequence database using a global alignment (Needleman-
     Wunsch)

     glsearch35, glearch35_t - compare a protein or DNA  sequence
     to  a  sequence  database with alignments that are global in
     the query and local in the database sequence (global-local).

     lalign35 - produce multiple non-overlapping  alignments  for
     protein  and  DNA  sequences  using the Huang and Miller sim
     algorithm for the Waterman-Eggert algorithm.

     prss35, prfx35 - discontinued; all the FASTA  programs  will


SunOS 5.10                Last change:                          1


Misc. Reference Manual Pages
                         FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local)


     estimate   statistical   significance   using  500  shuffled
     sequence scores if two sequences are compared.


DESCRIPTION
     Release 3.5 of the FASTA package provides a modular  set  of
     sequence  comparison  programs  that can run on conventional
     single processor computers or in parallel on  multiprocessor
     computers.   More   than   a   dozen   programs  -  fasta35,
     fastx35/tfastx35,    fasty35/tfasty35,     fasts35/tfasts35,
     fastm35,   fastf35/tfastf35,   ssearch35,   ggsearch35,  and
     glsearch35 - are currently available.

     All of the comparison programs share a set of basic  command
     line  options;  additional options are available for indivi-
     dual comparison functions.

     Threaded  versions  of  the   FASTA   programs   (fasta35_t,
     ssearch35_t, etc.)  will run in parallel on modern Linux and
     Unix multi-core or multi-processor  computers.   Accelerated
     versions  of  the Smith-Waterman algorithm are available for
     architectures with the Intel SSE2 or Altivec PowerPC  archi-
     tectures,  which can speed-up Smith-Waterman calculations 10
     - 20-fold.

     In addition to the serial and threaded versions of the FASTA
     programs,  PVM  and  MPI  parallel versions are available as
     pv35compfa, mp35compfaf, pv35compsw, mp35compsw,  etc.   For
     more information, see pvcomp.1, readme.pvm_mpi.  The PVM/MPI
     program versions use same command line options as the serial
     and threaded FASTA program versions.


Running the FASTA programs
     Although  the  FASTA  programs  can  be  run  interactively,
     prompting for a query file and a library, it is usually more
     convenient to run them from the Unix,  MacOSX  terminal,  or
     Windows shell command line.  Thus,

     fasta35_t   -q   -option1   -option2   -option3   query.file
     library.file > fasta.output

     runs the threaded version of fasta35 program, without asking
     for  any  input  (-q),  setting various parameter and output
     options,  comparing  the  sequences  in  query.file  to  the
     sequences  in library.file.  Optional arguments to the FASTA
     programs must  _p_r_e_c_e_d_e  the  query.file,  library.file,  and
     optional  _k_t_u_p  arguments.   The  FASTA  program provides an
     option (-O) _f_o_r _s_e_n_d_i_n_g _o_u_t_p_u_t _t_o _a _f_i_l_e, _b_u_t  _g_e_n_e_r_a_l_l_y  _i_t
     _i_s  _b_e_t_t_e_r _t_o _s_i_m_p_l_y _r_e_d_i_r_e_c_t _o_u_t_p_u_t _w_i_t_h _t_h_e ">" _s_h_e_l_l _s_y_m_-
     _b_o_l.


SunOS 5.10                Last change:                          2


Misc. Reference Manual Pages
                         FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local)


FASTA program options
     The default scoring matrix and gap penalties used by each of
     the   programs  have  been  selected  for  high  sensitivity
     searches with the various algorithms.  The  default  program
     behavior  can  be modified by providing command line options
     _b_e_f_o_r_e the query.file and library.file  arguments.   Command
     line options can also be used in interactive mode.

     Command line arguments come in several classes.

     (1) Commands that specify the comparison type. FASTA, FASTS,
     FASTM,  SSEARCH,  GGSEARCH,  and GLSEARCH can compare either
     protein or DNA sequences, and attempt to recognize the  com-
     parison  type  by  looking  the  residue composition. -n, -_p
     specify DNA  (nucleotide)  or  protein  comparison,  respec-
     tively. -U _s_p_e_c_i_f_i_e_s _R_N_A _c_o_m_p_a_r_i_s_o_n.

     (_2) _C_o_m_m_a_n_d_s _t_h_a_t _l_i_m_i_t _t_h_e _s_e_t _o_f _s_e_q_u_e_n_c_e_s  _c_o_m_p_a_r_e_d:  -_1,
     -3, -_M.

     (3) Commands that modify the scoring parameters: -f gap-open
     penaltyP, -g gap-extend penalty, -_h _i_n_t_e_r-_c_o_d_o_n _f_r_a_m_e-_s_h_i_f_t,
     -j   within-codon   frame-shift,   -_s   _s_c_o_r_i_n_g-_m_a_t_r_i_x,   -r
     match/mismatch score, -_x _X:_X _s_c_o_r_e.

     (4) Commands that modify the  algorithm  (mostly  FASTA  and
     [T]FASTX/Y):   -c,  -_w, -y, -_o. The -S _c_a_n _b_e _u_s_e_d _t_o _i_g_n_o_r_e
     _l_o_w_e_r-_c_a_s_e (_l_o_w  _c_o_m_p_l_e_x_i_t_y)  _r_e_s_i_d_u_e_s  _d_u_r_i_n_g  _t_h_e  _i_n_i_t_i_a_l
     _s_c_o_r_e _c_a_l_c_u_l_a_t_i_o_n.

     (_5) _C_o_m_m_a_n_d_s _t_h_a_t _m_o_d_i_f_y  _t_h_e  _o_u_t_p_u_t:  -_A,  -b  number,  -_C
     _w_i_d_t_h,  -d  number,  -_L, -m 0-11, -_w _l_i_n_e-_w_i_d_t_h, -W context-
     width, -_X _o_f_f_s_e_t_1,_o_f_s_e_t_2

     (6) Commands that affect statistical estimates: -Z, -_k.

Option summary:
     -1   Sort by "init1" score (obsolete)

     -3   (TFASTX/Y35 only) use only forward frame translations

     -a # "SHOWALL"  option  attempts  to  align  all   of   both
          sequences in FASTA and SSEARCH.

     -A   (FASTA35  DNA  comparison  only)  force  Smith-Waterman
          alignment  for  output.   Smith-Waterman is the default
          for FASTA protein alignment and [T]FASTX/Y, but not for
          DNA comparisons with FASTA.

     -b # number of best scores to show (must be  <   expectation
          cutoff  if -E is given).  By default, this option is no


SunOS 5.10                Last change:                          3


Misc. Reference Manual Pages
                         FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local)


          longer used; all scores  better  than  the  expectation
          (E()) cutoff are listed.

     -B   show z-scores rather than bit scores (for compatibility
          with much older versions).

     -c # threshold for band optimization (FASTA, [T]FASTX/Y)

     -C # length of name abbreviation in alignments, default = 6.
          Must be less than 20.

     -d # number of best alignments to show ( must be <  expecta-
          tion (-E) cutoff)

     -D   turn on debugging mode.   Enables  checks  on  sequence
          alphabet  that  cause  problems with tfastx35, tfasty35
          (only available after compile time option).

     -E # expectation value upper limit for score  and  alignment
          display.   Defaults  are 10.0 for FASTA35 and SSEARCH35
          protein searches, 5.0 for translated  DNA/protein  com-
          parisons, and 2.0 for DNA/DNA searches.

     -f # penalty for opening a gap.

     -F # expectation value lower limit for score  and  alignment
          display.   -F 1e-6 prevents library sequences with E()-
          values lower than 1e-6 from  being  displayed.  Use  to
          shift focus to more distant relationships.

     -g # penalty for additional residues in a gap

     -h # ([T]FASTX/Y only) penalty for a frameshift between  two
          codons.

     -j # ([T]FASTY only)  penalty  for  a  frameshift  within  a
          codon.

     -H   turn off histogram  display.  (The  meaning  of  -H  is
          reversed  with the PVM/MPI parallel versions, where the
          histogram display is off by default).

     -i   (FASTA  DNA,  [T]FASTX/Y)  compare  against  only   the
          reverse complement of the library sequence.

     -k   specify number of shuffles  for  statistical  parameter
          estimation  (default=500).  Shuffles  are done whenever
          the database size is smaller than this value;  in  par-
          ticular,  500 shuffles are done when only two sequences
          are aligned.  To disable shuffling, use -z -1 (no  sta-
          tistical estimates) or -z 2 (Altschul-Gish statistics).


SunOS 5.10                Last change:                          4


Misc. Reference Manual Pages
                         FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local)


     -l str
          specify FASTLIBS file

     -L   report long sequence description in alignments  (up  to
          200 characters).

     -m 0,1,2,3,4,5,6,9,10,11
          alignment display options.  -m 0, 1, 2, 3 display  dif-
          ferent types of alignments.  -m 4 provides an alignment
          "map" on the query. -m 5 combines the alignment map and
          a -m 0 alignment.  -m 6 provides an HTML output.

     -m 9 does not change  the  alignment  output,  but  provides
          alignment  coordinate  and percent identity information
          with the best scores report.  -m 9c adds encoded align-
          ment  information to the -m 9; -m 9i provides only per-
          cent identity and alignment length information with the
          best  scores.   With current versions of the FASTA pro-
          grams, independent -m options can be combined; e.g.  -m
          1 -m 9c -m 6.

     -m 11
          provide lav format output from lalign35.  It  does  not
          currently   affect  other  alignment  algorithms.   The
          lav2ps and lav2svg programs can be used to convert  lav
          format output to postscript/SVG alignment "dot-plots".

     -M #-#
          molecular weight (residue) cutoffs.  -M "101-200" exam-
          ines only sequences that are 101-200 residues long.

     -n   force query to nucleotide sequence

     -N # break long library sequences into blocks of # residues.
          Useful  for  bacterial  genomes,  which  have  only one
          sequence entry.  -N 2000 works well for well  for  bac-
          terial genomes.

     -o   (FASTA) turn fasta band optimization off during initial
          phase.   This  was  the  behavior  of fasta1.x versions
          (obsolete).

     -O file
          send output to file.

     -p   Force query sequence type to protein.

     -P "file type"
          specify a PSI-BLAST PSSM file of type "type". Available
          types are:


SunOS 5.10                Last change:                          5


Misc. Reference Manual Pages
                         FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local)


     0 - ascii PSSM file, produced by blastpgp -Q file.pssm  1  -
     binary  (architecture  dependent)  PSSM  file,  produced  by
     blastpgp -C file.pssm -u 0 2 -  binary  ASN.1  (architecture
     independent) PSSM file, produced by blastpgp -C file.pssm -u
     2

     -q/-Q
          quiet option; do not prompt for input

     -r "+n/-m"
          (DNA only) values for match/mismatch for DNA  comparis-
          ons.  +n  is used for the maximum positive value and -m
          is used for the maximum negative value. Values  between
          max and min, are rescaled, but residue pairs having the
          value -1 continue to be -1.

     -R file
          save all scores to statistics file (previously -r file)

     -s name
          specify  substitution  matrix.   BLOSUM50  is  used  by
          default;  PAM250, PAM120, and BLOSUM62 can be specified
          by setting -s P120, P250, or BL62.  With this  version,
          many  more  scoring  matrices  are available, including
          BLOSUM80 (BL80), and MDM10, MDM20, MDM40  (Jones,  Tay-
          lor,  and Thornton, 1992 CABIOS 8:275-282; specified as
          -s M10, -s M20, -s M40). Alternatively, BLASTP1.4  for-
          mat scoring matrix files can be specified.  BL80, BL62,
          and P120 are scaled in 1/2 bit  units;  all  the  other
          matrices  use  1/3 bit units.  DNA scoring matrices can
          also be specified with the "-r" option.

     -S   treat lower case letters in the query  or  database  as
          low  complexity regions that are equivalent to 'X' dur-
          ing the initial database scan, but are treated as  nor-
          mal residues for the final alignment display.  Statist-
          ical estimates are based on the 'X'ed out sequence used
          during the initial search. Protein databases (and query
          sequences) can be generated in the  appropriate  format
          using  John  Wooton's  "pseg"  program,  available from
          ftp://ncbi.nlm.nih.gov/pub/seg/pseg.   Once  you   have
          compiled the "pseg" program, use the command:

          pseg database.fasta -z 1 -q  > database.lc_seg

     -t # Translation table - [t]fastx35 and  [t]fasty35  support
          the       BLAST       tranlation      tables.       See
          http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi/.
          In  addition,  you  can  score for the end of a protein
          match with '-t -t' which will add "*"  to  the  end  of
          your   query   sequences   (but  your  protein  library


SunOS 5.10                Last change:                          6


Misc. Reference Manual Pages
                         FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local)


          sequences  must  also  have  '*').   Built  in  protein
          matrices  know  about '*:*' matches; if you want to use
          '-t t' with your own matrix, you will need  to  include
          '*' in the matrix.

     -T # (threaded, parallel only) number of threads or  workers
          to  use  (no limit for threaded version, set at compile
          time for PVM/MPI).

     -U   Do RNA sequence comparisons: treat 'T'  as  'U',  allow
          G:U  base  pairs  (by  scoring "G-A" and "T-C" as "G-G"
          -1).  Search only one strand.

     -V "?$%*"
          Allow special annotation characters in query  sequence.
          These characters will be displayed in the alignments on
          the coordinate number line.

     -w # line width for similarity  score,  sequence  alignment,
          output.

     -W # context length (default is 1/2 of line  width  -w)  for
          programs,  like  fasta  and ssearch, that provide addi-
          tional sequence context.

     -x #match,#mismatch
          scores used for matches to 'X:X','N:N', '*:*'  matches,
          and  the corresponding specified in the scoring matrix.
          If only one value is given, it is used for both values.

     -X "#,#"
          offsets query, library sequence  for  numbering  align-
          ments

     -y # Width for band optimization; by default 16 for DNA  and
          protein ktup=2; 32 for protein ktup=1;

     -z # Specify statistical calculation. Default is  -z  1  for
          local   similarity   searches,  which  uses  regression
          against the length of the library sequence. -z -1  dis-
          ables  statistics (and shuffling).  -z 0 estimates sig-
          nificance without normalizing for sequence length. -z 2
          provides maximum likelihood estimates for lambda and K,
          censoring the 250 lowest and 250 highest scores.  -z  3
          uses  Altschul  and  Gish's  statistical  estimates for
          specific protein BLOSUM scoring matrices and gap penal-
          ties.  -z  4,5:  an  alternate regression method.  -z 6
          uses a composition based  maximum  likelihood  estimate
          based  on  the  method of Mott (1992) Bull. Math. Biol.
          54:59-75.  -z 11,12,14,15,16:  compute  the  regression
          against  scores  of  randomly  shuffled  copies  of the


SunOS 5.10                Last change:                          7


Misc. Reference Manual Pages
                         FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local)


          library sequences.  Twice as many comparisons are  per-
          formed,  but  accurate  estimates can be generated from
          databases of related sequences. -z 11  uses  the  -z  1
          regression strategy, etc.

     -Z db_size
          Set the apparent database  size  used  for  expectation
          value  calculations (used for protein/protein FASTA and
          SSEARCH, and for [T]FASTX/Y).

Reading sequences from STDIN
     The FASTA programs have been  modified  to  accept  a  query
     sequence  from  the unix "stdin" data stream.  This makes it
     much easier to use fasta35 and its relatives as  part  of  a
     WWW  page.  To indicate that stdin is to be used, use "@" as
     the query sequence file name.   "@"  can  also  be  used  to
     specify a subset of the query sequence to be used, e.g:

     cat query.aa | fasta35 -q @:50-150 s

     would search  the  's'  database  with  residues  50-150  of
     query.aa.   FASTA  cannot  automatically detect the sequence
     type (protein vs DNA) when "stdin" is used and assumes  pro-
     tein comparisons by default; the '-n' option is required for
     DNA for STDIN queries.

Environment variables:
     FASTLIBS
          location of library choice file (-l FASTLIBS)

     SMATRIX
          default scoring matrix (-s SMATRIX)

     SRCH_URL
          the format string used to  define  the  option  to  re-
          search the database.

     REF_URL
          the format string used to define the option  to  lookup
          the library sequence in entrez, or some other database.


AUTHOR
     Bill Pearson
     wrp@virginia.EDU

     Version: $ Id: $ Revision: $Revision: 213 $


SunOS 5.10                Last change:                          8


Misc. Reference Manual Pages
                         FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local)


SunOS 5.10                Last change:                          9