SSEARCH(1) USER COMMANDS SSEARCH(1) NAME ssearch - scan a protein or DNA sequence library for similar sequences SYNOPSIS ssearch [-a -b # -d # -e -l _F_A_S_T_L_I_B_S -r _S_T_A_T_F_I_L_E -m # -Q -s _S_M_A_T_R_I_X -w # ] query-sequence-file library-file ssearch [-Qabdelmorsw] query-file @library-name-file ssearch [-Qabdelmrsw] query-file "%PRMVI" ssearch [-aelmrsw] - interactive mode DESCRIPTION ssearch compares a protein or DNA sequence to all of the entries in a sequence library using the rigorous Smith- Waterman algorithm (Smith and Waterman, J. Mol. Biol. (1983) 147:195-197. For example, ssearch can compare a protein sequence to all of the sequences in the NBRF PIR protein sequence database. ssearch will automatically decide whether the query sequence is DNA or protein by reading the query sequence as protein and determining whether the `amino-acid composition' is more than 85% A+C+G+T. The pro- gram can be invoked either with command line arguments or in interactive mode. ssearch compares a query sequence to a sequence library which consists of sequence data inter- spersed with comments, see below. The fasta programs, including ssearch, use a standard text format sequence file. Lines beginning with or lower case, blanks,tabs and unrecog- nizable characters are ignored. ssearch expects sequences to use the single letter amino acid codes, see protcodes(1) . Library files for ssearch should have the form shown below. OPTIONS ssearch can be directed to change the scoring matrix, search parameters, output format, and default search directories by entering options on the command line (preceeded by a `-'). All of the options should preceed the file name and ktup arguments). Alternately, these options can be changed by setting environment variables. The options and environment variables are: -a (SHOWALL) Modifies the display of the two sequences in alignments. Normally, both sequences are shown only where they overlap (SHOWALL=0); If -a or the environ- ment variable SHOWALL = 1, both sequences are shown in their entirety. Sun Release 4.1 Last change: local 1 SSEARCH(1) USER COMMANDS SSEARCH(1) -b # The number of similarity scores to be shown when the -Q option is used. This value is usually calculated based on the actual scores. -d # The number of alignments to be shown. Normally, ssearch shows the same number of alignments as similar- ity scores. By using ssearch -Q -b 200 -d 50, one would see the top scoring 200 sequences and alignments for the 50 best scores. -e scale the similarity scores by a factor of ln(n0)/ln(n1), where n0 and n1 are the lengths of the query and library sequence. This has the effect of increasing the scores of very short sequences, such as partial N-terminal sequences, and decreasing the scores of very long sequences, which are more likely to match by random chance. Unscaled scores are shown with the alignments. -l # (FASTLIBS) The name of the library menu file. Normally this will be determined by the environment variable FASTLIBS. However, a library menu file can also be specified with -l. -m # (MARKX) =0,1,2,3. Alternate display of matches and mismatches in alignments. MARKX=0 uses ":","."," ", for identities, consevative replacements, and non- conservative replacements, respectively. MARKX=1 uses " ","x", and "X". MARKX=2 does not show the second sequence, but uses the second alignment line to display matches with a "." for identity, or with the mismatched residue for mismatches. MARKX=2 is useful for aligning large numbers of similar sequences. MARKX=3 writes out a file of library sequences in FASTA format. MARKX=3 should always be used with the "SHOWALL" (-a) option, but this does not completely ensure that all of the sequences output will be aligned. report -Q Quiet option. This allows ssearch to search a database and the results without asking any questions. ssearch -Q file library > output can be put in the background or run at a later time with the unix 'at' command. The number of similarity scores and alignments displayed with the -Q option can be modified with the -b (scores) and -d (alignments) options. -r _S_T_A_T_F_I_L_E Causes ssearch to write out the sequence iden- tifier, superfamily number (if available), and similar- ity scores to _S_T_A_T_F_I_L_E for every sequence in the library. These results are not sorted. Sun Release 4.1 Last change: local 2 SSEARCH(1) USER COMMANDS SSEARCH(1) -s str (SMATRIX) the filename of an alternative scoring matrix file. For protein sequences, PAM250 is used by default; PAM120 can be used with the command line option -s 120. -w # (LINLEN) output line length for sequence alignments. (normally 60, can be set up to 200). EXAMPLES (1) ssearch musplfm.aa $AABANK Compare the amino acid sequence in the file musplfm.aa with the complete PIR protein sequence library. This is extremely slow and should almost never be done. ssearch is designed to search very small libraries of sequences. >LCBO bovine preprolactin WILLLSQ ... >LCHU human ... ... (2) ssearch -a -w 80 musplfm.aa lcbo.aa Compare the amino acid sequence in the file musplfm.aa with the sequences in the file lcbo.aa using _k_t_u_p = 1. Show both sequences in their entirety, with 80 residues on each output line. (3) ssearch Run the ssearch program in interactive mode. The program will prompt for the file name for the query sequence, list alternative libraries to be seached (if FASTLIBS is set), and prompt for the _k_t_u_p. You can use your own sequence files for ssearch, just be certain to put a '>' and comment as the first line before the sequence. SEE ALSO rss(1), align(1), fasta(1), rdf2(1),protcodes(5), dnacodes(5) AUTHOR Bill Pearson wrp@virginia.EDU Sun Release 4.1 Last change: local 3