VISA: An interactive program for the visual analysis of similarities in multiple amino acid sequences. CONCEPT VISA identifies amino acid patterns that are common to many members of a set of amino acid sequences, and dis- plays the distribution of common patterns along the se- quences in a series of histograms. Individual peaks of these histograms can be assigned different colors. Com- mon sequence patterns inherit and display the color of the peak in which they occur, leading to analogous seg- ments in the other sequences being marked in matching colors. These peaks usually correspond to the conserved sequence motifs that are characteristic of the studied proteins. The resulting color graphic overview of se- quence similarities can help to understand the architec- ture of the protein family and can be helpful while designing experiments to probe function. METHOD When sequences of a set of related proteins are loaded into VISA, the program locates amino acid triplets (three specific residues, separated by two short, but fixed length runs of nonspecific residues) that are com- mon to a preset fraction of sequences. A table on the screen shows how many common triplets can be found with different triplet size limits and with different com- monality indices. A left and a right mouse click over an item in this matrix sets the density of common triplets for the subsequent analysis. The distribution of common elements along the sequences is displayed on the main canvas in a set of histograms, one histogram per se- quence. Horizontal axes represent sequences; bar heights are proportional to the number of common patterns that match the sequence at given positions. With a click of the mouse the user selects one of the available colors, and with additional clicks the user puts brackets around a peak on one of the histograms. Common triplets in the sequence segment corresponding to this bracketed peak are then automatically in the selected color. Bars of the histograms will be repainted, and contributions of colored common triplets will be indicated by partially colored bars. Assignment of further colors to previously unpainted peaks - and to common triplets - can be con- tinued as long as more colors are available and more significant uncolored peaks can be found. To help the localization and display of global similarities the colored image can be further manipulated (e.g. rescaled, background adjusted, aligned). Management tools that al- low loading sequences and patterns from files, present- ing sequence annotation data, printing alignment in- formation, displaying conserved blocks in one-letter amino acid code, showing common triplet sets of con- served blocks, etc. are also at the user's disposal. USE Actions of the program are invoked mostly by clicking with the mouse over software control items like buttons, checkboxes or menus. These control items are either on the top control panel of the main window, or in one of the subwindows. The subwindows are activated by buttons that reside in the main control panel. Standard Xview manipulations (resizing, refreshing, hiding, scrolling, ...) are effective for the windows that are controlled from VISA. The set files button activates a window where file names for sequence, common pattern and output data have to be specified. Checkboxes create and load control whether an existing pattern file is to be used or a new pattern file is to be created. Press done when filenames and checkboxes are correctly set. Pressing the names button will bring up a window that displays some information (ID-s, sequence sizes, accession numbers and definition lines) about the analyzed sequences. Some or all of the file names for input and output data can be supplied on the command line, the program will load these nemes into the appropriate window items. Any order of names is accepted, flags -s, -p and -o precede the names (no blanks between them) and specify their in- terpretation. A new subwindow with a matrix of integers appears, ei- ther when new sequence and common pattern files are loaded or when the span/index button is clicked. Matrix element k at the intersection of row i and column j shows how many distinct triplets, that are no longer than i residues, occur in at least j sequences. One can set the span and commonality index parameters by click- ing the left mouse key (selection) then the right mouse key (confirmation) over the corresponding matrix ele- ment. The distribution of this selected set of k common patterns along the sequences will be displayed on the main canvas. One of the eight differently colored chips of the con- trol panel has shaded background, this is the active color that is used in the following bracketing, motif displaying and zooming operations. Active color selec- tion is done by a mouse click. When the pointer moves into the main canvas, the arrow changes to a crosshair. When the crosshair is placed on one of the sequence-representing horizontal lines and the left (right) mouse button is clicked, a left (right) bracket is deposited on the line. Common triplets that occur between the brackets will be assigned the active color. Click on the redisplay button, and the histograms get repainted. Some of the previously black vertical bars become partly or completely colored, depending on how many common triplets from the selected segment match the sequence at this position. After changing the active color, a second (third, etc.) peak can be bracketed, and additional common patterns can be assigned the new color. Clicks on the redisplay button update the color- ing of the histograms. Press the decolor button to delete all color assignments and to restart the coloring process. Frequently, corresponding sequence segments of homologous proteins will line up on the screen only if we offset the sequences and introduce appropriate gaps into them. Activating an item from the align menu will instruct VISA to do this. We can choose either simple alignment on a single color, or full alignment on all colors. In single color alignments all sequences are aligned to the sequence that has the bracket with the selected color (anchor sequence). Offset for a sequence is calculated so that the number of colored triplets matching the anchor sequence is maximized. When the full alignment item is selected, the dominant peaks (peaks with most matches to the corresponding peak in the anchor sequence) are determined for all the colors and for all the sequences first. Next the dominant order of peaks (the order with the highest number of common triplets) is determined, then a longest path algorithm chooses which peaks can be included in the alignment. Appropriate gaps are then inserted into the sequences, and the resulting alignment is displayed. Horizontal rescaling might be necessary when changing to a display of aligned sequences, because offsets and gaps have to be accommodated. This rescaling is done automatically. Conserved sequence blocks at the center of single color alignments can be displayed with the zoom button. Se- quence data will be displayed in one-letter amino acid code in the right side of the zoom window. Common tri- plets matching the colored triplets of the anchor se- quence will be painted in the active color. A frame in- dicates the positions of brackets in the anchor se- quence. Positions in the other sequences that align with the left-bracket-residue of the anchor sequence are shown on the left, together with sequence identifiers. Pressing the motif button performs an implicit alignment on the active color, then brings a new window up. This window contains a list of all common triplets that match to any sequence in the conserved block of the alignment. The triplets are offset according to their positions in this block. The triplets are preceded by a number, that indicates how many sequences the triplet will match with this offset. Pressing the print button will dump information about positions and sizes of active windows into auxiliary output files. Stand-alone shell scripts need these data for creating hard copies from the screen. A secondary control panel opens up when the options but- ton is pressed. Two item here modify assignments of colors to triplets, in two different ways. When over- paint is checked the new assignment will overwrite ear- lier ones, otherwise attempts to assign a color to a triplet that already has one assigned color will be re- jected. With expand paint box checked the earlier as- signment remain in effect, and (depending on the setting of overpaint) repeated bracketing assign the active color to additional triplets. When the box is unchecked repeated bracketing with the same color will erase ear- lier assignments of that color, and only the latest bracket will determine what triplets receive the active color. When the box suppress black is checked the paint routine is instructed not to show uncolored, black parts of the histogram. When box flip fg/bg is checked the background on the main canvas turns black, making light shades more visible. Two items in the options panel, horizontal scale and vertical scale, override the auto- matic histogram scaling. Five different sets of colors can be used in the analysis, the active palette is set by item palette selection. Item ignore threshold has a role in sequence alignments. When a sequence has very little resemblance to the other sequences (the score is under t percent of the average resemblance of sequences to the whole block; t is the actual value of the threshold parameter), it is omitted from the alignment. Sizes of sequences, and distances between peaks can be estimated by the use of the ruler subwindow. Scales in this subwindow are adjusted automatically. INPUT Sequence data should be presented in a multisequence file in the GCG dataset format. Every sequence in the file is introduced by four consecutive chevron marks (">>>>") followed by an identifier string. A new line contains the description of the sequence, additional lines (lengths up to 511 characters) contain sequence data. Common pattern data can be created (and saved) from within VISA, or read from a pre-prepared data file. To create this data file off-line, use gcgpat1, the same program that would be invoked from VISA. OUTPUT The primary output of VISA is on your screen. The top panel of the main window contains control tools (but- tons, menus). Under this control panel is the main canvas. Horizontal lines represent the sequences that are being analyzed. Lengths of lines are proportional to sequence lengths and can be measured by a software ruler. A zoom window displays aligned sequence blocks, a motif window shows which common triplets occur in aligned sequence blocks. An annotation window is used to display some information about the analyzed sequences. There are some provisions for making hard copies from the screen. VISA writes data about window sizes and positions into auxiliary output files (visa.corners, zoom.corners, motif.corners). Shell scripts (visadump.com, zoomdump.com and motifdump.com) use these data and several shareware utilities to dump parts of the screen and convert screendumps into Portable Pixel Map or Postscript standard files. The names of the target output files are specified as command line argu- ments for the appropriate script. EXAMPLE Start VISA by typing "visa" at your prompt. Press the PROCEED button in the welcome window. Click on the two LOAD checkboxes, then on the DONE but- ton in the "read sequence & pattern file" subwindow and load the demonstration set of xylanase sequences and their common patterns. Move the cursor over the number 503 (row of span 11, column of index 8) the span/index window, and click left. The number 503 gets framed, this is the number of triplets that are not longer than 11 residues and occur in at least 8 of the xylanase sequences. Click right mouse button to confirm the selection. Histograms appear on the main canvas, showing the distribution of common triplets along the sequences. Scroll the main canvas in both directions to see all the sequences. Move the haircross to the line of GUX_CELFI, to the left edge of the main peak, and click left. A small red left bracket should appear on the axis. Move the haircross to the right edge of this peak, about 1 cm right from the left bracket, and click on the right mouse button. A right bracket appears, tri- plets in the enclosed interval are assigned red color. Click on the REDISPLAY botton. Wherever the selected triplets occur in the sequences, they get colored red. You should see red peaks on the histograms. Click on MOTIF. A new subwindow shows the selected com- mon triplets. This motif subwindow can be removed using the window controlling pin (upper left corner). Click on ZOOM to see how the sequences align on these red peaks. You should use the horizontal scrollbar to center aligned block into the middle of the text window. The limits of the aligned block are determined by the color assigning brackets of the histogram. This zoom subwindow can be removed by the standard XVIEW method (select QUIT from the frame menu). Click on the green square of the control panel, and select a second, uncolored peak by moving the cros- shair and clicking left, then right mouse buttons. Repeat this step, select the blue chip and bracket a third peak. Click on REDISPLAY to show how the selected triplets cluster in the sequences. Use MOTIF and ZOOM to dis- play the triplets and the alignments corresponding to the active color (the one selected in the control panel). Push the right button over the ALIGN menu and select the red item of the pop-up menu. The colored histograms will be shifted so that their red peaks come into alignment. The horizontal scales get automatically changed. Select the ALL item of the ALIGN menu. Colored histograms will be shifted and gaps will be inserted into them so that their peaks come into alignment, if possible. Selection of NONE will remove all gaps and leading offsets. A click on DECOLOR deletes all color selections. Press SPAN/INDEX and select a different set of triplets. A click on RULER brings up a small subwindow with scal- ing. Drag this over your sequence to measure sizes and distances. A click on OPTIONS activates a control panel. Checked FLIP box inverts the screen fg/bg colors. A checked SUPPRESS BLACK box will instruct the display routine not to display uncolored triplets. Experiment with HORIZONTAL SCALE, VERTICAL SCALE and PALETTE SELEC- TION items. A click on PRINT writes the coordinates of the main, zoom and motif windows into auxiliary files To create a postscript file (named e.g. xylanase.ps) with the image of the main window of VISA, run com- mand "visadump.com xylanase.ps". This command has to be issued from a different window. Do not move, resize or hide your VISA windows between your last PRINT and the completion of visadump.com script! Press QUIT to end the session and to leave VISA. REMARKS Bar heights of the histograms are calculated by counting the number of common triplets that match the position under consideration. One matching triplet contributes to three bars, with one for each of its three specific residues. Single color alignment and full alignment may result in different conserved blocks if the order of colored peaks are not the same in all of the sequences. Direct hard copy generation is intentionally not built into the program. Different users may have different hardware for this purpose, that would require different file formats. Also dumps of color images may produce huge output files. Overly easy access to screendumps may result in quickly filling disks. ADDITIONAL TOOLS gcgpat1: to create common pattern data file. visadump.com (zoomdump.com, motifdump.com): makes a screendump of the primary (zoom, motif) canvas and converts this dump into a printable Postscript file. DISTRIBUTION by anonymous FTP Type "ftp vent.neb.com". Log in as "anonymous". Use your e-mail address at the password prompt. Type "cd /pub/software/visa" to move to the appropriate ditrectory To get the files type "mget *.*". Type "quit" to leave ftp. If you have no Internet connection, send a request for the program to the address shown below. INSTALLATION The program runs on Sun workstations that use XView software. Detailed instructions are in the install.man file of the distribution kit. REFERENCE J. Posfai, Z. Szaraz and R.J.Roberts, "VISA: Visual Se- quence Analysis for the comparison of multiple amino acid sequences", Comp. Appl. Biosci. 10 (1994), pp.537- 544. CONTACTS With your comments, questions or suggestions please con- tact: Richard J. Roberts (or Janos Posfai) New England Biolabs 32 Tozer Rd., Beverly, MA 01915. phone: 508-927-5054 e-mail: roberts@neb.com (posfai@neb.com).