PLNT4610/PLNT7690 Bioinformatics
Lecture 4, part 1 of 2

PAIRWISE SIMILARITY AND ALIGNMENTS

REFERENCES

Fristensky, B. (1986) Improving the efficiency of dot-matrix similarity searches through use of an oligomer table. Nucleic Acids Research 14:597-610

Needleman SB and Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol 48:443-453.

Schuler GD (1998) Sequence alignment and database searching In Baxevanis AD and Ouellette BFF. Eds. Bioinformatics: A practical guide to the analysis of genes and proteins. John Wiley, Toronto.

Setubal, J. and Meidanis, J. (1997) Introduction to Computational Molecular Biology. PWS Publishing Co., Toronto. Ch. 3 "Sequence Comparison and Database Search".

Pearson WR (1998) Flexible sequence similarity searching with the FASTA3 program package. http://people.virginia.edu/~wrp/papers/mmol98f.pdf

A. Similarity, homology, and analogy

B. Graphic similarity comparisons

1. Similarity between two sequences can be detected as a diagonal on an identitiy matrix.

2. Similarity comparisons can be speeded up using lookup tables containing positions of short words (oligomers) from one of the sequences.

3. Similarity searches can also be used to detect direct repeats and inverted repeats.

Tutorial: Dot-matrix similarity comparisons

C. Global and local optimal alignments

1. Global sequence alignment by dynamic programming

2. Scoring matrices

Tutorial: Pairwise Sequence Alignment

A. Similarity, homology and analogy

1. Terms

Identical - When a corresponding character is shared between two species or populations, that character is said to be identical.

Similar - The degree to which two species or populations share identities.

Homologous - When characters are similar due to common ancestry, they are homologous.

Analogous - When characters are similar due to convergent evolution, they are analogous.

Orthologous - When characters are homologous with conserverd function, they are orthologous.

Paralogous - When characters are homologous with divergent function, they are paralogous.

Homology is therefore NOT synonomous with similarity. Homology is a judgement, similarity is a measurement.

2. Why do similarity searches?

identify an unknown sequence
determine which regions are conserved between proteins or nucleic acids (ie. most biologically significant)
genome assembly
genome annotation
comparative genomics
metagenomics

B. Graphic similarity comparisons

1. Similarity between two sequences can be detected as a diagonal on an identitiy matrix.

Humans are remarkable in their ability to recognize patterns by 'just looking at it'. So far, programmers have had only limited success in devising algorithms (the computer equivalent of laboratory protocols) for pattern recognition. At the same time, humans are poor at highly repetitive tasks with large quantities of data.

Graphic similarity comparisons use the power of the computer to present relationships between sequences in such a graphic form that enables the human researcher to discern patterns in the data. If we wish to determine whether two sequences are similar, we must compare all parts of one sequence with all parts of the other. This could be accomplished by sliding one sequence along the other and noting the number of identities at each alignment. The alignment with the greatest number of identities would be the optimal alignment.

GGCTTGACCGG-->
     |    |
     GGATTGACCCG

     GGCTTGACCGG-->
     || |||||| |
     GGATTGACCCG

     GGCTTGACCGG-->
     | | |
GGATTGACCCG

The same thing could be accomplished by placing both sequences on the X and Y axes of a matrix, and printing a character at each X,Y coordinate at which both sequences have identical bases.

G G C T T G A C C G G

G A A

A

A A

G A A

A

A A

A

A

T

A A

T

A A

G A A

A

A A

A

A

C

A

A A

C

A

A A

C

A

A A

G A A

A

A A

This is the simplest form of a "dot-matrix" comparison. Where part of one sequence shares a long stretch of similarity with the other sequence, a diagonal of dots will be evident in the matrix. This approach is exhaustive, because the matrix encompasses all possible alignments. However, when single bases are compared at each position, most of the dots in the matrix will be due to background similarity. That is, for any two nucleotides compared between the two sequences, there is a 1 in 4 chance of a match, assuming equal frequencies of A,G,C and T.

ALGORITHM
Dot-matrix comparison, l=1

input: Sequences: s of length m, t of length n
output: matrix a[1..m,1..n]

for i = 1..m   // for each nucleotide in s
     for j = 1..n  // for each nucleotide in t

          if s[i] = t[j] then
             a[i,j] = 'A'

This background noise can be filtered out by comparing groups of l nucleotides, rather than single nucleotides, at each position. For example, if we compare dinucleotides (l = 2), the probability of two dinucleotides chosen at random from each sequence matching is 1/16, rather than 1/4. Therefore, the number of background matches will be lower:

G G C T T G A C C G G

G A

A

G

A

A

T

A

T

A

G

A

A

A

C

A

C

A

C

A

G

The dot-matrix algorithm can be generalized for sequences s and t of sizes m and n, respectively, and window size l. For each position in sequence s, compare a window of l nucleotides centered at that position with each window of l nucleotides in sequence t. Conceptually, you can think of windows of length l sliding along each axis, so that all possible windows of l nucleotides are compared between the two sequences.

For sequences of realistic length, it's not practical to write both sequences on the axes, so instead numbers are used to represent position in each sequence. Also for longer sequences, a window size of l=2 is too small, because as sequences increase in length, the frequencies of dinucleotide matches will increase.

Example: Comparison of two soybean chlorophyll a/b binding protein genes (X12980, X12981)

In the example, a compression of 25 is used, meaning that each row and column in the matrix represents 25 nucleotides, so that each cell represents 25² = 625 comparisons of l = 20 nucleotides. The diagonal encompasses most of the matrix, indicating that these two genes share strong similarity over most of their length. In this example, the Minimum Percent Similarity is set to 60, meaning that for a character to be printed in the matrix, a given 20 nucleotide window must be at least 60% identical between the two sequences (ie. 12 out of 20). To indicate the quality of each match, a character code is used:

Char. % Char. % Char. % Char. %

A 100 N 74-75 a 48-49 n 22-23

B 98-99 O 72-73 b 46-47 o 20-21

C 96-97 P 70-71 c 44-45 p 18-19

D 94-95 Q 68-69 d 42-43 q 16-17

E 92-93 R 66-67 e 40-41 r 14-15

F 90-91 S 64-65 f 38-39 s 12-13

G 88-89 T 62-63 g 36-37 t 10-11

H 86-87 U 60-61 h 34-35 u 9-8

I 84-85 V 58-59 i 32-33 v 6-7

J 82-83 W 56-57 j 30-31 w 5

K 80-81 X 54-55 k 28-29

L 78-79 Y 52-53 l 26-27

M 76-77 Z 50-51 m 24-25

Pustell J and Kafatos F (1982) A high speed, high capacity similarity matrix: zooming through SV40 and polyoma Nucl. Acids Res. 10: 4765-4782.

Because users already know the order of letters in the alphabet, the character codes provide an intuitive picture of the quality of the match in all parts of the sequences. In the example, the presence of A's in the diagonal from about 825 to 875 in both sequences shows that this region is highly conserved, whereas similarity drops off outside of this region. The first 250 bases of both sequences show little similarity. The GenBank entries indicate that the protein coding sequence begins at 251 in each sequences. Thus, the 5' non-coding regions of these genes must be poorly conserved.

2. Similarity comparisons can be speeded up using lookup tables containing positions of short words (oligomers) from one of the sequences.

The efficiency of the dot-matrix similarity search algorithm can be stated as O(lmn). That means that the time required to compare two sequences is proportional to the products of the lengths of the sequences times the length of the search window. For example, a comparison of two 5000 nt sequences using a window size of 20 would require 5000 x 5000 x 20 = 2.5 x 10⁸ nucleotide comparisons (5 x 10⁸ if we consider comparing both strands).

A quick inspection of most matrix similarity outputs shows that the vast majority of the area of the matrix contains either blank space, which indicates that no local similarities were found, or very small similarities which have probably occurred at random. Thus, the majority of the search time is spent investigating regions of non-similarity.

Wilbur and Lipman (1983) have recognized the fact that even imperfect similarities are likely to share small regions of perfect similarity (eg. 4 bases). Given the probability p of any two characters matching, the probability that two k-mers chosen at random is simply p^k. The expected distance between two occurrences of k matches is therfore 1/p^k. For example, if the probability of a single nucleotide match is 1/4, then trinucleotides should match on the average of once every 64 bases. These matches are due to background similarity. Since regions which share significant similarity must by definition have a frequency of matches which is higher than background, similar regions will have more frequent k-mer matches, and are consequently more likely to be found. If, when searching the X-axis sequence, we knew in advance where matches of k nucleotides occurred, we might only look at those places to find out if the match extended to a length of l nucleotides. It can be shown that a lookup table of k-mers in any sequence can be can be constructed in O(n) steps. An example of a lookup table for trinucleotides is shown below:

Table 1. Example of a Lookup Table
Locations of the 64 possible trinucleotides in sequence X. The numbers shown indicate the position of the central nucleotide in a triplet, as they might occur in some hypothetical DNA sequence.

Trinucleotide Location(s) in seq.X

AAA 13, 71, 179, 204, ...

AAC 35, 72, 123, 199, ...

AAG 7, 50, 87, 104, 249, ...

... .......

... .......

TTG 2, 40, 95, 172, ...

TTT 77, 94, 169, 195, ...

Using the table as a guide, each occurrence of that trinucleotide in sequence X is located, and the region centered on that position, w nucleotides to the left and the right, is compared with the corresponding region in sequence Y. If the match is good enough, a symbol is printed at the point in the matrix which corresponds to the centers of the two regions. The process is repeated for each trinucleotide in sequence Y. Since each trinucleotide occurs on the average only once every 64 bases, the algorithm only makes N/64 searches for each triplet in Y, rather than N. Generally, the efficiency of this algorighm is O(lmn/S^k), where S is the alphabet size. For nucleic acids, S=4 (A,G,C,T), so a trinucleotide search would provide a 64-fold increase in speed, while a tetranucleotide search provides a 256-fold speedup. For amino acid sequences, S=20, so a a search set to k=1 provides a 20-fold increase in speed, while k=2 provides a 400-fold increase.

In essence, the lookup table speeds up the search by sampling the X-axis sequence where perfect oligomer matches occur, rather than exhaustively comparing every possible window of l nucleotides between two sequences. The algorithm can be summarized thus:

ALGORITHM
Dot-matrix comparison, k=3

input: Sequences: s of length m, t of length n
const: MINPER  // minimum percentage match
output: matrix a[1..m,1..n]

Maketable(TAB(x,y,z),t) // make lookup table using t

for i = 1..m   // for each nucleotide in s
     set x,y,z to central triplet in window
     for each position t listed in TAB(x,y,z)
            if MINPER/l bases match then
              a[i,j] = CharCode(MINPER/l)
              //CharCode returns character to print
              // for a given percent identity

To ensure a thorough search, we must choose a combination of k value and window size such that the window l bases wide which is searched at one k-mer match will overlap the adjacent window. The average distances between k-matches for different values of p and k are given in Table 2.

Table 2. Avg. dist. between k-matches
1
-------
p^k

Prob. of a match (p) k= 2 3 4 5

0.050 400 8000

0.075 178 2370

0.100 100 1000

0.150 44 296

0.200 25 125

0.250 16 64 256 1024

0.300 11 37 123 412

0.350 8 23 67 190

0.450 5 11 24 54

0.600 3 5 8 13

0.700 2 3 4 6

0.900 1 1 1 2

For example, if the probability of a match between two DNA sequences is 0.25, we expect to see a dinucleotide match once every 16 bases in a comparison. Trinucleotides will match on the average of once every 64 bases, and so on. These matches are due to background similarity. Since regions which share significant similarity must by definition have a frequency of matches which is higher than background, similar regions will have more frequent k-mer matches, and consequently are more likely to be found.

Table 2 illustrates how the overall level of similarity between two sequences affects the expected distance between k-mer matches. The knowledge of the expected frequency of k-mer matches allows us to predict the level of similarity likely to be missed. If we wish to find nucleotide similarities with 30% match or better, a triplet search (k=3) will necessitate the use of a window size l >= 19, since the average distance between triplet matches is 37. The actual choice of k and l values will depend on the purpose of the search.

3. Similarity searches can also be used to detect direct repeats and inverted repeats.

One of the strengths of dot-matrix searches is that they make repeats easy to detect by comparing a sequence against itself. In self comparisons, direct repeats appear as diagonals parallel to the main line of identity. For example, each member of the human AluI middle sequence family is itself made of two subrepeats. In the human AluI sequence p27 (GenBank K01153) sequence from approximately 30..120 is imperfectly repeated at positions 140..260.

D3HOM         Version  5/13/91
X-axis: >HUMRSA27
Y-axis: >HUMRSA27
SIMILARITY RANGE:  15      MIN.PERCENT SIMILARITY:  50
SCALE FACTOR:    0.95      COMPRESSION:             10

              100       200

       I |      . ||      .     |   .         .
        A|      . ||      .     |   .         .
---------A--------|R      .     |   .         .
         |A     . | R     .     |   .         .
         | A    . |  TT   .     |   .         .
         |  C   . |   V Y .     |   .         .
         |   B  . |      YZ     |   .         .
         |    A . |     Y .     |   .         .
         |     A. |       .     |   .         .
   100 ..|......A.|..........Y..|..............
         |      .A|       .   U |   .         .
---------|--------A------------XZ   .         .
---------R      . |A      .         .         .
          R     . | A   Y .         .         .
           T    . |  A  VX.         .         .
           TV   . |   A   .         .         .
                . |    C Z.         .         .
            Y Y . | YV  C .         .         .
             Y  . |  X Z A.         .         .
   200 ......Z....|.......A....................
                . |       .A        .         .
                . |       . C       .         .
                Y |       .  B      .         .
                .U|       .   A     .         .
                . X       .    AZ   .         .
------------------Z       .    ZF   .         .

A more interesting example is seen in human sequence p16, which contains both an AluI family sequence, as well as eight tandem repeats, which themselves are made up of two imperfect repeats:

Human clone p16 (GenBank K01154)

Self comparison of p16 (low resolution)

Self comparison of p16 (high resolution, 66nt repeats only)

Unless otherwise cited or referenced, all content on this page is licensed under the Creative Commons License Attribution Share-Alike 2.5 Canada

last page PLNT4610/PLNT7690 Bioinformatics
Lecture 4, part 1 of 2 next page

Char.	%	Char.	%	Char.	%	Char.	%
A	100	N	74-75	a	48-49	n	22-23
B	98-99	O	72-73	b	46-47	o	20-21
C	96-97	P	70-71	c	44-45	p	18-19
D	94-95	Q	68-69	d	42-43	q	16-17
E	92-93	R	66-67	e	40-41	r	14-15
F	90-91	S	64-65	f	38-39	s	12-13
G	88-89	T	62-63	g	36-37	t	10-11
H	86-87	U	60-61	h	34-35	u	9-8
I	84-85	V	58-59	i	32-33	v	6-7
J	82-83	W	56-57	j	30-31	w	5
K	80-81	X	54-55	k	28-29
L	78-79	Y	52-53	l	26-27
M	76-77	Z	50-51	m	24-25
Pustell J and Kafatos F (1982) A high speed, high capacity similarity matrix: zooming through SV40 and polyoma Nucl. Acids Res. 10: 4765-4782.

Table 1. Example of a Lookup Table Locations of the 64 possible trinucleotides in sequence X. The numbers shown indicate the position of the central nucleotide in a triplet, as they might occur in some hypothetical DNA sequence.
Trinucleotide	Location(s) in seq.X
AAA	13, 71, 179, 204, ...
AAC	35, 72, 123, 199, ...
AAG	7, 50, 87, 104, 249, ...
...	.......
...	.......
TTG	2, 40, 95, 172, ...
TTT	77, 94, 169, 195, ...

Table 2.	Avg. dist. between k-matches 1 ------- p^k
Prob. of a match (p)	k= 2	3	4	5
0.050	400	8000
0.075	178	2370
0.100	100	1000
0.150	44	296
0.200	25	125

0.250	16	64	256	1024
0.300	11	37	123	412
0.350	8	23	67	190
0.450	5	11	24	54
0.600	3	5	8	13
0.700	2	3	4	6
0.900	1	1	1	2

Sept. 26, October 1, 2024

PAIRWISE SIMILARITY AND ALIGNMENTS

A. Similarity, homology, and analogy

B. Graphic similarity comparisons

1. Similarity between two sequences can be detected as a diagonal on an identitiy matrix.

2. Similarity comparisons can be speeded up using lookup tables containing positions of short words (oligomers) from one of the sequences.

3. Similarity searches can also be used to detect direct repeats and inverted repeats.

C. Global and local optimal alignments

1. Global sequence alignment by dynamic programming

2. Scoring matrices

A. Similarity, homology and analogy

1. Terms

2. Why do similarity searches?

B. Graphic similarity comparisons

1. Similarity between two sequences can be detected as a diagonal on an identitiy matrix.

2. Similarity comparisons can be speeded up using lookup tables containing positions of short words (oligomers) from one of the sequences.

3. Similarity searches can also be used to detect direct repeats and inverted repeats.