last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 4, part 1 of 2
next page

September 28, 2017

SIMILARITY AND ALIGNMENTS



REFERENCES

Fristensky, B. (1986) Improving the efficiency of dot-matrix similarity searches through use of an oligomer table. Nucleic Acids Research 14:597-610

Needleman SB and Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol 48:443-453.

Schuler GD (1998) Sequence alignment and database searching In Baxevanis AD and Ouellette BFF. Eds. Bioinformatics: A practical guide to the analysis of genes and proteins. John Wiley, Toronto.

Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology. PWS Publishing Co., Toronto. Ch. 3 "Sequence Comparison and Database Search".

Pearson WR (1998) Flexible sequence similarity searching with the FASTA3 program package. http://people.virginia.edu/~wrp/papers/mmol98f.pdf


A. Similarity, homology, and analogy

B. Graphic similarity comparisons

1. Similarity between two sequences can be detected as a diagonal on an identitiy matrix.

2. Similarity comparisons can be speeded up using lookup tables containing positions of short words (oligomers) from one of the sequences.

3. Similarity searches can also be used to detect direct repeats and inverted repeats.

C. Global and local optimal alignments

1. Global sequence alignment by dynamic programming

2. Scoring matrices


A. Similarity, homology and analogy

1. Terms

Identical - When a corresponding character is shared between two species or populations, that character is said to be identical.

Similar -  The degree to which two species or populations share identities.

Homologous - When characters are similar due to common ancestry, they are homologous.

Analogous - When characters are similar due to convergent evolution, they are analogous.

Orthologous - When characters are homologous with conserverd function, they are orthologous.

Paralogous - When characters are homologous with divergent function, they are paralogous.

Homology is therefore NOT synonomous with similarity. Homology is a judgement, similarity is a measurement.

2. Why do similarity searches?

B. Graphic similarity comparisons

1. Similarity between two sequences can be detected as a diagonal on an identitiy matrix.

Humans are remarkable in their ability to recognize patterns by 'just looking at it'. So far, programmers have had only limited success in devising algorithms (the computer equivalent of laboratory protocols) for pattern recognition. At the same time, humans are poor at highly repetitive tasks with large quantities of data.

Graphic similarity comparisons use the power of the computer to present relationships between sequences in such a graphic form that enables the human researcher to discern patterns in the data. If we wish to determine whether two sequences are similar, we must compare all parts of one sequence with all parts of the other. This could  be accomplished by sliding one sequence along the other and noting the number of identities at each alignment. The alignment with the greatest number of identities would be the optimal alignment.
 
 
GGCTTGACCGG--> 
     |    |
     GGATTGACCCG 
     GGCTTGACCGG--> 
     || |||||| |
     GGATTGACCCG 
     GGCTTGACCGG--> 
     | |  |
GGATTGACCCG 

The same thing could be accomplished by placing both sequences on the X and Y axes of a matrix, and printing a character at each X,Y coordinate at which both sequences have identical bases.
 
 

G G C T T G A C C G G
G A A


A


A A
G A A


A


A A
A





A



T


A A





T


A A





G A A


A


A A
A





A



C

A



A A

C

A



A A

C

A



A A

G A A


A


A A

This is the simplest form of a "dot-matrix" comparison. Where part of one sequence shares a long stretch of similarity with the other sequence, a diagonal of dots will be evident in the matrix. This approach is exhaustive, because the matrix encompasses all possible alignments. However, when single bases are compared at each position, most of the dots in the matrix will be due to background similarity. That is, for any two nucleotides compared between the two sequences, there is a 1 in 4 chance of a match, assuming equal frequencies of A,G,C and T.
 

ALGORITHM
Dot-matrix comparison, l=1
input: Sequences: s of length m, t of length n
output: matrix a[1..m,1..n]
for i = 1..m   // for each nucleotide in s
     for j = 1..n  // for each nucleotide in t
          if s[i] = t[j] then
         a[i,j] = 'A'

This background noise can be filtered out by comparing groups of l nucleotides, rather than single nucleotides, at each position. For example, if we compare dinucleotides (l = 2), the probability of two dinucleotides chosen at random from each sequence matching is 1/16, rather than 1/4. Therefore, the number of background matches will be lower:
 

G G C T T G A C C G G
G A







A
G




A




A










T


A






T



A





G




A




A





A



C






A


C






A


C







A

G











The dot-matrix algorithm can be generalized for sequences s and t of sizes m and n, respectively, and window size l. For each position in sequence s, compare  a window of l nucleotides centered at that position with each window of l nucleotides in sequence t. Conceptually, you can think of windows of length l sliding along each axis, so that all possible windows of l nucleotides are compared between the two sequences.
 

For sequences of realistic length, it's not practical to write both sequences on the axes, so instead numbers are used to represent position in each sequence. Also for longer sequences, a window size of l=2 is too small, because as sequences increase in length, the frequencies of dinucleotide matches will increase.

Example: Comparison of two soybean chlorophyll a/b binding protein genes (X12980, X12981)

In the example, a compression of 25 is used, meaning that each row and column in the matrix represents 25 nucleotides, so that each cell represents 252 = 625 comparisons of l =  20 nucleotides. The diagonal encompasses most of the matrix, indicating that these two genes share strong similarity over most of their length. In this example, the Minimum Percent Similarity is set to 60, meaning that for a character to be printed in the matrix, a given 20 nucleotide window must be at least 60% identical between the two sequences (ie. 12 out of 20). To indicate the quality of each match, a character code is used:
 
Char. % Char. % Char. % Char. %
A 100 N 74-75 a 48-49 n 22-23
B 98-99 O 72-73 b 46-47 o 20-21
C 96-97 P 70-71 c 44-45 p 18-19
D 94-95 Q 68-69 d 42-43 q 16-17
E 92-93 R 66-67 e 40-41 r 14-15
F 90-91 S 64-65 f 38-39 s 12-13
G 88-89 T 62-63 g 36-37 t 10-11
H 86-87 U 60-61 h 34-35 u 9-8
I 84-85 V 58-59 i 32-33 v 6-7
J 82-83 W 56-57 j 30-31 w 5
K 80-81 X 54-55 k 28-29

L 78-79 Y 52-53 l 26-27

M 76-77 Z 50-51 m 24-25

Pustell J and Kafatos F (1982) A high speed, high capacity similarity matrix: zooming through SV40 and polyoma Nucl. Acids Res. 10: 4765-4782.

Because users already know the order of letters in the alphabet, the character codes provide an intuitive picture of the quality of the match in all parts of the sequences. In the example, the presence of  A's in the diagonal from about 825 to 875 in both sequences shows that this region is highly conserved, whereas similarity drops off outside of this region. The first 250 bases of both sequences show little similarity. The GenBank entries indicate that the protein coding sequence begins at 251 in each sequences. Thus, the 5' non-coding regions of these genes must be poorly conserved.

2. Similarity comparisons can be speeded up using lookup tables containing positions of short words (oligomers) from one of the sequences.

The efficiency of the dot-matrix similarity search algorithm can be stated as O(lmn). That means that the time required to compare two sequences is proportional to the products of the lengths of the sequences times the length of the search window. For example, a comparison of two 5000 nt sequences using a window size of 20 would require 5000 x 5000 x 20 = 2.5 x 108 nucleotide comparisons (5 x 108 if we consider comparing both strands).

A quick inspection of most matrix similarity outputs shows that the vast majority of the area of the matrix contains either blank space, which indicates that no local similarities were found, or very small similarities which have probably occurred at random.  Thus, the  majority of the search time is spent investigating regions of  non-similarity.

Wilbur  and Lipman (1983) have  recognized the fact that  even  imperfect  similarities  are  likely  to share small regions of perfect  similarity  (eg.  4  bases). Given the probability p of any two characters matching, the probability that two k-mers chosen at random is simply pk. The expected distance between two occurrences of k matches is therfore 1/pk. For example, if the probability of a single nucleotide match is 1/4, then trinucleotides should  match on the average of once every 64 bases. These matches are due to background similarity. Since regions which share significant similarity must by definition have a frequency of matches which is higher than background, similar regions will have more frequent k-mer matches, and are consequently more likely to be found. If, when searching the X-axis sequence, we knew in advance where matches of k nucleotides occurred, we might only look at those places to find out if the match extended to a length of l nucleotides. It can be shown that a lookup table of k-mers in any sequence can be can be constructed in O(n) steps. An example of a lookup table for trinucleotides is shown below:
 
Table 1. Example of a Lookup Table 

Locations of the 64 possible trinucleotides in sequence X. The numbers shown indicate the position of the central nucleotide in a triplet, as they might occur in some hypothetical DNA sequence.

Trinucleotide Location(s) in seq.X 
AAA 13, 71, 179, 204, ...
AAC  35, 72, 123, 199, ... 
AAG 7, 50, 87, 104, 249, ...
... ....... 
... .......
TTG 2, 40, 95, 172, ...
TTT 77, 94, 169, 195, ... 

Using  the  table  as  a  guide,  each  occurrence  of  that trinucleotide  in sequence  X is  located, and  the region centered on that position, w nucleotides  to the left  and the right,  is compared with  the corresponding region in sequence Y.  If the  match is good enough, a symbol is printed  at the point in the matrix  which corresponds to the centers of the  two  regions.   The  process  is  repeated  for  each trinucleotide in sequence Y.  Since each trinucleotide occurs on the average only once every 64 bases,  the algorithm only  makes N/64 searches  for each triplet  in Y, rather than N. Generally, the efficiency of this algorighm is O(lmn/Sk), where S is the alphabet size. For nucleic acids, S=4 (A,G,C,T), so a trinucleotide search would provide a 64-fold increase in speed, while a tetranucleotide search provides a 256-fold speedup. For amino acid sequences, S=20, so a a search set to k=1 provides a 20-fold increase in speed, while k=2 provides a 400-fold increase.

In essence, the lookup table speeds up the search by sampling the X-axis sequence where perfect oligomer matches occur, rather than exhaustively comparing every possible window of l nucleotides between two sequences. The algorithm can be summarized thus:
 

ALGORITHM
Dot-matrix comparison, k=3
input: Sequences: s of length m, t of length n
const: MINPER  // minimum percentage match
output: matrix a[1..m,1..n]

Maketable(TAB(x,y,z),t) // make lookup table using t

for
i = 1..m   // for each nucleotide in s
     set x,y,z to central triplet in window
     for each position t listed in TAB(x,y,z)
          
  if MINPER/l bases match then
         a[i,j] = CharCode(MINPER/l)
            //CharCode returns character to print
 // for a given percent identity

 

To ensure a thorough search, we must choose a  combination of k value and window size such that the window l bases wide  which is searched at one k-mer  match  will  overlap  the  adjacent  window.   The average distances between k-matches  for different values  of p and k are given  in Table 2.
 

 
Table 2. Avg. dist. between k-matches 

  1
 -------
pk

Prob. of a match (p) k= 2 3 4 5
0.050 400 8000

0.075 178 2370

0.100 100 1000

0.150 44 296

0.200 25 125






0.250 16 64 256 1024
0.300 11 37 123 412
0.350 8 23 67 190
0.450 5 11 24 54
0.600 3 5 8 13
0.700 2 3 4 6
0.900 1 1 1 2

For example, if  the probability of  a match between  two DNA sequences  is 0.25,  we  expect  to  see  a  dinucleotide  match once every 16 bases in a comparison.   Trinucleotides will  match on  the average  of once  every 64 bases, and so on.  These  matches are due to background  similarity.  Since regions  which  share  significant  similarity  must  by  definition have a frequency of  matches which is higher than background, similar regions will have more frequent  k-mer matches, and consequently are  more  likely to be found.

Table  2 illustrates how  the overall level  of similarity between two sequences  affects  the  expected  distance  between   k-mer  matches.  The knowledge of the expected frequency  of k-mer matches allows us  to predict the  level  of  similarity  likely  to  be  missed.   If  we  wish  to find nucleotide similarities  with  30%  match  or  better,  a  triplet  search  (k=3) will necessitate the  use of a window  size l >= 19,  since the average distance between triplet matches is 37. The actual choice of  k and l values will  depend on  the purpose  of the  search.

3. Similarity searches can also be used to detect direct repeats and inverted repeats.

One of the strengths of dot-matrix searches is that they make repeats easy to detect by comparing a sequence against itself. In self comparisons, direct repeats appear as diagonals parallel to the main line of identity. For example, each member of the human AluI middle sequence family is itself made of two subrepeats. In the human AluI sequence p27 (GenBank K01153) sequence from approximately 30..120 is imperfectly repeated at positions 140..260.
 
 
D3HOM         Version  5/13/91
X-axis: >HUMRSA27
Y-axis: >HUMRSA27
SIMILARITY RANGE:  15      MIN.PERCENT SIMILARITY:  50
SCALE FACTOR:    0.95      COMPRESSION:             10

              100       200

       I |      . ||      .     |   .         .
        A|      . ||      .     |   .         .
---------A--------|R      .     |   .         .
         |A     . | R     .     |   .         .
         | A    . |  TT   .     |   .         .
         |  C   . |   V Y .     |   .         .
         |   B  . |      YZ     |   .         .
         |    A . |     Y .     |   .         .
         |     A. |       .     |   .         .
   100 ..|......A.|..........Y..|..............
         |      .A|       .   U |   .         .
---------|--------A------------XZ   .         .
---------R      . |A      .         .         .
          R     . | A   Y .         .         .
           T    . |  A  VX.         .         .
           TV   . |   A   .         .         .
                . |    C Z.         .         .
            Y Y . | YV  C .         .         .
             Y  . |  X Z A.         .         .
   200 ......Z....|.......A....................
                . |       .A        .         .
                . |       . C       .         .
                Y |       .  B      .         .
                .U|       .   A     .         .
                . X       .    AZ   .         .
------------------Z       .    ZF   .         .

A more interesting example is seen in human sequence p16, which contains both an AluI family sequence, as well as eight tandem repeats, which themselves are made up of two imperfect repeats:

Human clone p16 (GenBank K01154)

Self comparison of p16 (low resolution)

Self comparison of p16 (high resolution, 66nt repeats only)
 

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada

 
last  page PLNT4610/PLNT7690 Bioinformatics
Lecture 4, part 1 of 2
next page