NAME pima - Pattern-Induced Multi-sequence Alignment program SYNOPSIS pima [options] cluster_name seq_filename [ref_seq_name sec_struct_seq_filename] EXAMPLES pima SAMPLE sample-family.fa pima SAMPLE-STRUCT sample-struct.fa 1ldm pdb-dssp.ss DESCRIPTION pima performs a multi-sequence alignment of a set of (presumably related) sequences using an extension of our covering pattern construction algorithm (Smith and Smith 1990, 1992). All pairwise comparisons between sequences in the set are performed and the resulting scores clustered into one or more families using using two different linkage rules: 1) maximal linkage (Smith and Smith, 1990) and 2) sequential branching (see Smith and Smith, 1992). For the latter, all pairwise scores are sorted high-to-low, the first sequence from the highest scoring pair is chosen as the "reference sequence", and the sequences clustered based strictly on the order of similarity to the reference sequence. Each cluster is then multiply-aligned using a pattern-based alignment algorithm (Smith and Smith, 1992). Patterns are constructed using the alphabet shown below. If secondary structure sequences are provided for one or more of the primary sequences (one of which must be desig- nated as a "reference sequence") then the sequences are clustered using the sequentially branching rule and the set multiply-aligned using a secondary structure- dependent gap penalty algorithm (Smith and Smith, 1992). Pattern Alphabet: the pattern alphabet includes the standard single-letter IUPAC codes for the 20 amino acids plus addi- tional characters for 63 combinations of amino-acids. These combinations provide the highest amount of information (i.e., most abundant as compared to random expectation) observed in our database of aligned sequence families. J IV i ILV q NS 0 NK 9 AL + KT ] AGS U RK j LF r AP 1 PT ! NT , GP ^ GT a DE B ND s EK 2 NG # ES / KS _ NE b IL k LM t QK 3 QH $ IT : LT { IF c FY m GS u RQ 4 LS % DS ; NH | SV d ST n AV v DG 5 TV & RS < QS } RP e AS Z QE w LP 6 HY ( QP ? QL ~ RH f LV o AT y EG 7 IM ) AST @ MV . GK h AG p PS z RG 8 AE * ILM [ EP X (wildcard) Gaps are denoted by "g"s. PARAMETERS cluster_name An arbitrary name used to label the cluster. seq_filename Name of the input file containing the sequences to be clustered and multi-aligned. Sequences can be in any of the following formats: IG/Stanford, GenBank/GB, NBRF, EMBL, Pearson/Fasta, PIR/CODATA, Table (LOCUS_NAMESEQUENCE [one seq/line]). LOCUS_NAMES can not contain left or right parentheses. The format of the output sequence files will match the format of this input file. ref_seq_name [optional; if specified, then sec_struct_seq_filename must also be specified]. Locus name of one of the pri- mary sequences for which the secondary structure is in the file seq_struct_seq_filename. sec_struct_seq_filename [optional; if specified, then ref_seq_name must also be specified] Name of a file containing secondary structure sequences for one or more of the primary sequences in the set. The secondary structure sequences in this file must be in one of the formats listed above (see sequence_filename, above). The locus name of each sequence must be the locus name of it's corresponding primary sequence with the suffix '.ss' (e.g. 1ldm.ss). An alpha-helix, 3-10 helix and beta-strand must be designated 'h', 'g', and 'e', repectively. All other characters in the secondary structure sequences will be ignored with respect to the the structure-dependent gap penalty. To allow gaps to be placed between the first and the second and the last elements of these structures, the first and last 2 elements of each should be changed to another character designation. In the secondary structure sequence file pdb-dssp.ss provided with this package, these end cap elements are designated 'i', 'f', and 'd', for alpha-helices, 3-10 helices and beta-strands, respectfully. OPTIONS -c number Use a cluster score cutoff of number. This is the lowest match score to be used to incorporate a sequence into a cluster. The default value of 0.0 will force all input sequences into 1 cluster, but the final pattern may be completely degenerate. -d number Use a length dependent gap penalty of number. This is the cost of extending a gap. The default value is dependent on the matrix file used. -h This option will print a short help message and quit. -i number Use a length independent gap penalty of number. This is the cost of opening a gap. The default value is dependent on the matrix file used. -l number Use minimum local score of number. This is the lowest score a quadrant can have before an attempt is made to join this local align- ment with the local alignment at the previous step. The default value is dependent on the matrix file used. -m file Use matrix file with the name file. The default matrix file is patgen.mat and is pro- vided with this package. The matrix file class1.mat uses the original pima alphabet. The matrix file class2.mat is also provided, which is similar to the matrix file class1.mat but uses the new alphabet. -n Do not use numerical extensions on each step of the alignment. -t number Use a secondary structure gap penalty of number. This is the cost of a gap at a posi- tion matching a secondary structure charac- ter. The default value is dependent on the matrix file used and is always 10 times the value of the length independent gap penalty of the matrix file. -u characters Use characters as the list of secondary structure characters instead of the default characters of hge. -w number Use a minimum local alignment width of number instead of the default 15. A quadrant with a width less than this value is ignored and no attempt to join this local alignment with the local alignment at the previous step. -M Only perform maximal linkage. This option will also drop the -ML from the output file names. To see the default values for a give matrix run the program pima-pm, enter the name of the matrix for which you want to see the default values. Hit return until you see the default value of the parameter you are interested and then just interupt (control-C) the program. OUTPUT FILES CREATED cluster_name--ML|SB][.ext].cluster The cluster tree(s)s created by the clustering algorithm(s): maximal linkage clusters are labelled with '-ML' appended to the cluster_name; sequential branching clusters are labeled '-SB'. If more than one cluster is generated from the input sequence set, each cluster is given an extension (cluster_name-ML.1, cluster_name-ML.2, etc). Each cluster in a cluster file is represented as a nested list with sequence names separated by a match score, e.g.: CLUSTER_NAME-ML((A 200.0 B) 150.0 C) File format: cluster_name- [ML|SB][.ext]cluster_nested_list cluster_name[-ML|-SB][.ext].pattern The "root" AACC pattern constructed from each cluster. File format: cluster_name- [ML|SB][.ext]AACC_sequence cluster_name[-ML|-SB][.ext].pima The pattern-induced multiple-sequence alignment of each clustered sequence set; includes the "nodal" pat- terns used to align the sequences (the nodal patterns have the locus name cluster_name-[ML|SB].ext -- exten- sions added to the sequence names match the extension of the nodal-pattern used to align the corresponding sequence subset, e.g. seq_1-ML.1 and seq_2-ML.1 would be aligned by nodal-pattern cluster_name-ML.1 . File format: Will be created the same as the input sequence file, sequence_filename. REQUIRED AUXILLARY PROGRAMS/SCRIPTS/FILES Programs: cluster-pima, pima-mso, pima-pm, extract-cluster- loci, extract-records, extract-root-pat, print-cluster, trim-root-num, print-pima, make-cluster, make-pattern Files: class1.mat, class2.mat, patgen.mat NOTES Only minimal sequence information is maintained by the sequence input and output routines. Additionally not every aspect of the various sequence file formats is handled correctly. If in doubt, please use sequence files that are in Fasta or table format. REFERENCES Smith, Randall F. and Smith, Temple F. (1990). Automatic generation of primary sequence patterns from sets of related protein sequences. PNAS 87:118-122. Smith, Randall F. and Temple F. Smith (1992). Pattern- Induced Multi-sequence Alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for comparitive protein modelling. Protein Engineering 5:35-41. Randall F. Smith Human Genome Center, Dept. of Molecular and Human Genetics, Baylor College of Medicine, Houston TX 77096 rsmith@bcm.tmc.edu Temple F. Smith Molecular Bio-Enginnering Research Center Boston Univ., 36 Cummington St, Boston, MA 02115 tsmith@darwin.bu.edu Copyright (c) 1990, 1991, 1992, MBCRR, Dana-Farber Cancer Institute and Harvard University. Copyright (c) 1993, 1994, Baylor College of Medicine.