TUTORIAL: DOT-MATRIX SIMILARITY COMPARISONS


 Fristensky, B. (1986) Improving the efficiency of dot-matrix similarity searches through use of an oligomer table. Nucleic Acids Research 14:597-610

DXHOM documentation: $doc/fsap/hom.asc


1. Copy sample files to $home/tutorial/dotmatrix

{goad:/home/plants/frist}cd
{goad:/home/plants/frist}cd tutorials
{goad:/home/plants/frist/tutorial}mkdir dotmatrix
{goad:/home/plants/frist/tutorial}cp $birch/tutorials/GDE/dotmatrix/*.gen dotmatrix
{goad:/home/plants/frist/tutorial}cd dotmatrix
{goad:/home/plants/frist/tutorial/dotmatrix}ls -l
total 58
-rw------- 1 frist drr 5404 Nov 6 17:22 ARBLKSP.gen
-rw------- 1 frist drr 4494 Nov 6 17:21 GMCAB2.gen
-rw------- 1 frist drr 4238 Nov 6 17:21 GMCAB3.gen
-rw------- 1 frist drr 4338 Nov 6 17:29 KPLACBG.gen
-rw------- 1 frist drr 3674 Nov 6 17:21 PEACAB15.gen
-rw------- 1 frist drr 3757 Nov 6 17:22 WHTCAB.gen

Read in the GenBank files using File --> Open.

2. A simple comparison between two sequences

Hint: Similarity comparisons require you to select two or more sequences at a time.
  • Select sequence(s) - either:
    • drag across several adjacent sequences
    • hold down the 'SHIFT' key and click on several sequences
  • Choose a program from one of the menus

Select GMCAB2 and GMCAB3 by dragging the cursor across the two names, and choose Similarity --> DXHOM. The DXHOM menu will appear:

Output from GMCAB2 vs. GMCAB3 using DXHOM defaults.
 

3. How search parameters affect the output

a. Compression: Zooming in and out - Each row and column in the matrix represents one or more nucleotide/amino acid positions. By default, A compression factor of 10 is used, with 70 charcters per line, meaning that 700 nucleotides can be represented per line or column.  In the example above, the X-axis sequence GMCAB2 is 1354 bp long, so DXHOM must break the output up into two pages, one in which GMCAB2 positons 1..700  are comared with all of GMCAB3, and another in which 701..1354 are compared with GMCAB3. The entire comparison can be fit into a single page by changing the compression factor to 20. [Output with "compression factor" (COMPRESS) =20].

b. The signal to noise ratio is controlled by local search window size and similarity cut off value.

The parameter "min. % similarity printed (MINPER)" sets the minimum score for a character to be printed in the matrix. The lower the percent match allowed, the higher the sensitivity, but the greater the background noise due to random chance. [Output with min. % similarity printed=50].

The window size is expressed as the distance from the center of the search window. That is , the search window contains d bases on either side of the center of the k-tuple match. So if the distance is 10, a window of 21 bases (ie. 10 on either side of the central nucleotide) is compared. The wider the search window, the lower the probability of a match by random chance. Thus, lower MINPER values require larger search window sizes. The result from above can be "cleaned up" by doubling the window size. [Output with window Dist. from center of window= 20 ie. window size = 41].

Some reduction in background noise can also be gained by using a k-tuple of 4. This will also speed up the search by a factor of 4, but may be less sensitive.

Finally, the signal to noise ratio can also be improved by simply doing the plot at a lower compression. One way is to use a wider output line (scroll left to right to see entire matrix)

Dist. from center of window: 10
MINPER = 50
COMPRESS = 10
width of output line: (LINEWIDTH) = 130

4. It is often necessary to compare both strands of one sequence with one strand of the other sequence.

To illustrate the importance of comparing BOTH strands, the Bluescript vector, containing a beta-lactamase gene for ampicillin resistance, will be compared with the beta lactamase gene from Klebsiella pneumonia. Since Bluescript is 2958 bp long, we need to use COMPRESS= 30 and LINEWIDTH= 100 to make the entire matrix fit onto one page.
 
NOTE: If you plan to compare both strands of a  reference sequence with one strand of another sequence, in each plot the reference sequence, or its inverse complement, MUST be placed on the X-axis. 

DXHOM is designed to preserve the coordinate system of the reference sequence, regardless of which strand is printed. That is, if you do the first search with the original strand Bluescript as the reference sequence, the coordinates of Bluescript would be written left to right on the X-axis, 1 to 2958. It is INCORRECT to simply generate the inverse complement of Bluescript and repeat the search, numbering 1.. 2958. The numbering at the top should be 2958 downto 1. In that way, a given coordinate always refers to the same part of the sequence, regardless of the strand.
 

Although there's no straightforward way to get GDE to tell DXHOM which sequence goes onto the X-axis, when multiple sequences are selected, they are sent to programs in the order they appear in the GDE sequence list, top to bottom. Thus, if you want ARBLKSP to be on the X-axis, it must appear above KPLACBG in the GDE window.


 
 

Compare the original strand of Bluescript (ARBLKSP) with the Klebsiella gene (KPLACBG).

Dist. from center of window: 10
min % similarity: 60
COMPRESS=30
LINEWIDTH=100
As seen in the output, there are no significant diagonals, indicating that there are no
significant similarities between these two strands. Now, let's generate the opposite
strand of Bluescript for comparison.

a. Make a copy of ARBLKSP and rename it ARBLKSP-opp.

Select ARBLKSP and choose Edit --> Copy and Edit --> Paste to create a duplicate of this sequence. Next, select one of the copies and choose File --> Get Info. Add "-opp" to the name to indicate that this copy of the sequence will be changed to the opposite strand. This is an important step, because otherwise you have no way to keep track of which strand is which!  I'm not kidding here. This is important. You laugh, but I guarantee you'll mess things up if you don't follow organizational tips like this.

b. Create the inverse complement - 2 steps

First, create the complementary strand by selecting ARBLKSP-opp. Choose DNA/RNA --> Complement.

You can verify for yourself that the complementary strand has been created. If you think about it, you'll realize that this strand now reads 3'  --> 5' going left to right. Therefore, it must also be reversed. Choose Edit --> Reverse to create the inverse complement.

Now, ARBLKSP-opp containse the inverse complement of ARBLKSP. To compare it with KPLACBG,  select both sequences and set the parameters as shown:

There are 3 parameters that must be set to tell DXHOM that we are using the opposite strand. Obviously, click on "opposite". Additoinally, STARTX must be set to the length of the X-axis sequence, which can be done by just pulling the STARTX slider all the way to the right. If DXHOM gets a number larger than the length of the sequence, it uses the sequence length. At the same time, FINISHX must be set to 1. Just pull the FINISHX slider to the left.

As illustrated in the map of Bluescript shown above, the DXHOM output verifies that the opposite strand of Bluescript has a beta-lactamase gene, going roughly from 2680 downto 1960. [DXHOM output using opposite strand of ARBLKSP].