Misc. Reference Manual Pages FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local) NAME fasta35, fasta35_t - scan a protein or DNA sequence library for similar sequences fastx35, fastx35_t - compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames. tfastx35, tfastx35_t - compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations. fasty35, fasty35_t - compare a DNA sequence to a protein sequence database, comparing the translated DNA sequence in forward and reverse frames. tfasty35, tfasty35_t - compare a protein sequence to a DNA sequence database, calculating similarities with frameshifts to the forward and reverse orientations. fasts35, fasts35_t - compare unordered peptides to a protein sequence database fastm35, fastm35_t - compare ordered peptides (or short DNA sequences) to a protein (DNA) sequence database tfasts35, tfasts35_t - compare unordered peptides to a translated DNA sequence database fastf35, fastf35_t - compare mixed peptides to a protein sequence database tfastf35, tfastf35_t - compare mixed peptides to a translated DNA sequence database ssearch35, ssearch35_t - compare a protein or DNA sequence to a sequence database using the Smith-Waterman algorithm. ggsearch35, ggsearch35_t - compare a protein or DNA sequence to a sequence database using a global alignment (Needleman- Wunsch) glsearch35, glearch35_t - compare a protein or DNA sequence to a sequence database with alignments that are global in the query and local in the database sequence (global-local). lalign35 - produce multiple non-overlapping alignments for protein and DNA sequences using the Huang and Miller sim algorithm for the Waterman-Eggert algorithm. prss35, prfx35 - discontinued; all the FASTA programs will SunOS 5.10 Last change: 1 Misc. Reference Manual Pages FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local) estimate statistical significance using 500 shuffled sequence scores if two sequences are compared. DESCRIPTION Release 3.5 of the FASTA package provides a modular set of sequence comparison programs that can run on conventional single processor computers or in parallel on multiprocessor computers. More than a dozen programs - fasta35, fastx35/tfastx35, fasty35/tfasty35, fasts35/tfasts35, fastm35, fastf35/tfastf35, ssearch35, ggsearch35, and glsearch35 - are currently available. All of the comparison programs share a set of basic command line options; additional options are available for indivi- dual comparison functions. Threaded versions of the FASTA programs (fasta35_t, ssearch35_t, etc.) will run in parallel on modern Linux and Unix multi-core or multi-processor computers. Accelerated versions of the Smith-Waterman algorithm are available for architectures with the Intel SSE2 or Altivec PowerPC archi- tectures, which can speed-up Smith-Waterman calculations 10 - 20-fold. In addition to the serial and threaded versions of the FASTA programs, PVM and MPI parallel versions are available as pv35compfa, mp35compfaf, pv35compsw, mp35compsw, etc. For more information, see pvcomp.1, readme.pvm_mpi. The PVM/MPI program versions use same command line options as the serial and threaded FASTA program versions. Running the FASTA programs Although the FASTA programs can be run interactively, prompting for a query file and a library, it is usually more convenient to run them from the Unix, MacOSX terminal, or Windows shell command line. Thus, fasta35_t -q -option1 -option2 -option3 query.file library.file > fasta.output runs the threaded version of fasta35 program, without asking for any input (-q), setting various parameter and output options, comparing the sequences in query.file to the sequences in library.file. Optional arguments to the FASTA programs must _p_r_e_c_e_d_e the query.file, library.file, and optional _k_t_u_p arguments. The FASTA program provides an option (-O) _f_o_r _s_e_n_d_i_n_g _o_u_t_p_u_t _t_o _a _f_i_l_e, _b_u_t _g_e_n_e_r_a_l_l_y _i_t _i_s _b_e_t_t_e_r _t_o _s_i_m_p_l_y _r_e_d_i_r_e_c_t _o_u_t_p_u_t _w_i_t_h _t_h_e ">" _s_h_e_l_l _s_y_m_- _b_o_l. SunOS 5.10 Last change: 2 Misc. Reference Manual Pages FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local) FASTA program options The default scoring matrix and gap penalties used by each of the programs have been selected for high sensitivity searches with the various algorithms. The default program behavior can be modified by providing command line options _b_e_f_o_r_e the query.file and library.file arguments. Command line options can also be used in interactive mode. Command line arguments come in several classes. (1) Commands that specify the comparison type. FASTA, FASTS, FASTM, SSEARCH, GGSEARCH, and GLSEARCH can compare either protein or DNA sequences, and attempt to recognize the com- parison type by looking the residue composition. -n, -_p specify DNA (nucleotide) or protein comparison, respec- tively. -U _s_p_e_c_i_f_i_e_s _R_N_A _c_o_m_p_a_r_i_s_o_n. (_2) _C_o_m_m_a_n_d_s _t_h_a_t _l_i_m_i_t _t_h_e _s_e_t _o_f _s_e_q_u_e_n_c_e_s _c_o_m_p_a_r_e_d: -_1, -3, -_M. (3) Commands that modify the scoring parameters: -f gap-open penaltyP, -g gap-extend penalty, -_h _i_n_t_e_r-_c_o_d_o_n _f_r_a_m_e-_s_h_i_f_t, -j within-codon frame-shift, -_s _s_c_o_r_i_n_g-_m_a_t_r_i_x, -r match/mismatch score, -_x _X:_X _s_c_o_r_e. (4) Commands that modify the algorithm (mostly FASTA and [T]FASTX/Y): -c, -_w, -y, -_o. The -S _c_a_n _b_e _u_s_e_d _t_o _i_g_n_o_r_e _l_o_w_e_r-_c_a_s_e (_l_o_w _c_o_m_p_l_e_x_i_t_y) _r_e_s_i_d_u_e_s _d_u_r_i_n_g _t_h_e _i_n_i_t_i_a_l _s_c_o_r_e _c_a_l_c_u_l_a_t_i_o_n. (_5) _C_o_m_m_a_n_d_s _t_h_a_t _m_o_d_i_f_y _t_h_e _o_u_t_p_u_t: -_A, -b number, -_C _w_i_d_t_h, -d number, -_L, -m 0-11, -_w _l_i_n_e-_w_i_d_t_h, -W context- width, -_X _o_f_f_s_e_t_1,_o_f_s_e_t_2 (6) Commands that affect statistical estimates: -Z, -_k. Option summary: -1 Sort by "init1" score (obsolete) -3 (TFASTX/Y35 only) use only forward frame translations -a # "SHOWALL" option attempts to align all of both sequences in FASTA and SSEARCH. -A (FASTA35 DNA comparison only) force Smith-Waterman alignment for output. Smith-Waterman is the default for FASTA protein alignment and [T]FASTX/Y, but not for DNA comparisons with FASTA. -b # number of best scores to show (must be < expectation cutoff if -E is given). By default, this option is no SunOS 5.10 Last change: 3 Misc. Reference Manual Pages FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local) longer used; all scores better than the expectation (E()) cutoff are listed. -B show z-scores rather than bit scores (for compatibility with much older versions). -c # threshold for band optimization (FASTA, [T]FASTX/Y) -C # length of name abbreviation in alignments, default = 6. Must be less than 20. -d # number of best alignments to show ( must be < expecta- tion (-E) cutoff) -D turn on debugging mode. Enables checks on sequence alphabet that cause problems with tfastx35, tfasty35 (only available after compile time option). -E # expectation value upper limit for score and alignment display. Defaults are 10.0 for FASTA35 and SSEARCH35 protein searches, 5.0 for translated DNA/protein com- parisons, and 2.0 for DNA/DNA searches. -f # penalty for opening a gap. -F # expectation value lower limit for score and alignment display. -F 1e-6 prevents library sequences with E()- values lower than 1e-6 from being displayed. Use to shift focus to more distant relationships. -g # penalty for additional residues in a gap -h # ([T]FASTX/Y only) penalty for a frameshift between two codons. -j # ([T]FASTY only) penalty for a frameshift within a codon. -H turn off histogram display. (The meaning of -H is reversed with the PVM/MPI parallel versions, where the histogram display is off by default). -i (FASTA DNA, [T]FASTX/Y) compare against only the reverse complement of the library sequence. -k specify number of shuffles for statistical parameter estimation (default=500). Shuffles are done whenever the database size is smaller than this value; in par- ticular, 500 shuffles are done when only two sequences are aligned. To disable shuffling, use -z -1 (no sta- tistical estimates) or -z 2 (Altschul-Gish statistics). SunOS 5.10 Last change: 4 Misc. Reference Manual Pages FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local) -l str specify FASTLIBS file -L report long sequence description in alignments (up to 200 characters). -m 0,1,2,3,4,5,6,9,10,11 alignment display options. -m 0, 1, 2, 3 display dif- ferent types of alignments. -m 4 provides an alignment "map" on the query. -m 5 combines the alignment map and a -m 0 alignment. -m 6 provides an HTML output. -m 9 does not change the alignment output, but provides alignment coordinate and percent identity information with the best scores report. -m 9c adds encoded align- ment information to the -m 9; -m 9i provides only per- cent identity and alignment length information with the best scores. With current versions of the FASTA pro- grams, independent -m options can be combined; e.g. -m 1 -m 9c -m 6. -m 11 provide lav format output from lalign35. It does not currently affect other alignment algorithms. The lav2ps and lav2svg programs can be used to convert lav format output to postscript/SVG alignment "dot-plots". -M #-# molecular weight (residue) cutoffs. -M "101-200" exam- ines only sequences that are 101-200 residues long. -n force query to nucleotide sequence -N # break long library sequences into blocks of # residues. Useful for bacterial genomes, which have only one sequence entry. -N 2000 works well for well for bac- terial genomes. -o (FASTA) turn fasta band optimization off during initial phase. This was the behavior of fasta1.x versions (obsolete). -O file send output to file. -p Force query sequence type to protein. -P "file type" specify a PSI-BLAST PSSM file of type "type". Available types are: SunOS 5.10 Last change: 5 Misc. Reference Manual Pages FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local) 0 - ascii PSSM file, produced by blastpgp -Q file.pssm 1 - binary (architecture dependent) PSSM file, produced by blastpgp -C file.pssm -u 0 2 - binary ASN.1 (architecture independent) PSSM file, produced by blastpgp -C file.pssm -u 2 -q/-Q quiet option; do not prompt for input -r "+n/-m" (DNA only) values for match/mismatch for DNA comparis- ons. +n is used for the maximum positive value and -m is used for the maximum negative value. Values between max and min, are rescaled, but residue pairs having the value -1 continue to be -1. -R file save all scores to statistics file (previously -r file) -s name specify substitution matrix. BLOSUM50 is used by default; PAM250, PAM120, and BLOSUM62 can be specified by setting -s P120, P250, or BL62. With this version, many more scoring matrices are available, including BLOSUM80 (BL80), and MDM10, MDM20, MDM40 (Jones, Tay- lor, and Thornton, 1992 CABIOS 8:275-282; specified as -s M10, -s M20, -s M40). Alternatively, BLASTP1.4 for- mat scoring matrix files can be specified. BL80, BL62, and P120 are scaled in 1/2 bit units; all the other matrices use 1/3 bit units. DNA scoring matrices can also be specified with the "-r" option. -S treat lower case letters in the query or database as low complexity regions that are equivalent to 'X' dur- ing the initial database scan, but are treated as nor- mal residues for the final alignment display. Statist- ical estimates are based on the 'X'ed out sequence used during the initial search. Protein databases (and query sequences) can be generated in the appropriate format using John Wooton's "pseg" program, available from ftp://ncbi.nlm.nih.gov/pub/seg/pseg. Once you have compiled the "pseg" program, use the command: pseg database.fasta -z 1 -q > database.lc_seg -t # Translation table - [t]fastx35 and [t]fasty35 support the BLAST tranlation tables. See http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi/. In addition, you can score for the end of a protein match with '-t -t' which will add "*" to the end of your query sequences (but your protein library SunOS 5.10 Last change: 6 Misc. Reference Manual Pages FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local) sequences must also have '*'). Built in protein matrices know about '*:*' matches; if you want to use '-t t' with your own matrix, you will need to include '*' in the matrix. -T # (threaded, parallel only) number of threads or workers to use (no limit for threaded version, set at compile time for PVM/MPI). -U Do RNA sequence comparisons: treat 'T' as 'U', allow G:U base pairs (by scoring "G-A" and "T-C" as "G-G" -1). Search only one strand. -V "?$%*" Allow special annotation characters in query sequence. These characters will be displayed in the alignments on the coordinate number line. -w # line width for similarity score, sequence alignment, output. -W # context length (default is 1/2 of line width -w) for programs, like fasta and ssearch, that provide addi- tional sequence context. -x #match,#mismatch scores used for matches to 'X:X','N:N', '*:*' matches, and the corresponding specified in the scoring matrix. If only one value is given, it is used for both values. -X "#,#" offsets query, library sequence for numbering align- ments -y # Width for band optimization; by default 16 for DNA and protein ktup=2; 32 for protein ktup=1; -z # Specify statistical calculation. Default is -z 1 for local similarity searches, which uses regression against the length of the library sequence. -z -1 dis- ables statistics (and shuffling). -z 0 estimates sig- nificance without normalizing for sequence length. -z 2 provides maximum likelihood estimates for lambda and K, censoring the 250 lowest and 250 highest scores. -z 3 uses Altschul and Gish's statistical estimates for specific protein BLOSUM scoring matrices and gap penal- ties. -z 4,5: an alternate regression method. -z 6 uses a composition based maximum likelihood estimate based on the method of Mott (1992) Bull. Math. Biol. 54:59-75. -z 11,12,14,15,16: compute the regression against scores of randomly shuffled copies of the SunOS 5.10 Last change: 7 Misc. Reference Manual Pages FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local) library sequences. Twice as many comparisons are per- formed, but accurate estimates can be generated from databases of related sequences. -z 11 uses the -z 1 regression strategy, etc. -Z db_size Set the apparent database size used for expectation value calculations (used for protein/protein FASTA and SSEARCH, and for [T]FASTX/Y). Reading sequences from STDIN The FASTA programs have been modified to accept a query sequence from the unix "stdin" data stream. This makes it much easier to use fasta35 and its relatives as part of a WWW page. To indicate that stdin is to be used, use "@" as the query sequence file name. "@" can also be used to specify a subset of the query sequence to be used, e.g: cat query.aa | fasta35 -q @:50-150 s would search the 's' database with residues 50-150 of query.aa. FASTA cannot automatically detect the sequence type (protein vs DNA) when "stdin" is used and assumes pro- tein comparisons by default; the '-n' option is required for DNA for STDIN queries. Environment variables: FASTLIBS location of library choice file (-l FASTLIBS) SMATRIX default scoring matrix (-s SMATRIX) SRCH_URL the format string used to define the option to re- search the database. REF_URL the format string used to define the option to lookup the library sequence in entrez, or some other database. AUTHOR Bill Pearson wrp@virginia.EDU Version: $ Id: $ Revision: $Revision: 213 $ SunOS 5.10 Last change: 8 Misc. Reference Manual Pages FASTA/SSEARCH/[T]FASTX/Y/LALIGN 1(local) SunOS 5.10 Last change: 9