BIRCH

TUTORIAL: PAIRWISE SIMILARITY SEARCHES

update Oct. 2, 2014


FASTA documentation $doc/fasta/fasta36.1.html


Example: Antifreeze Proteins

It is sometimes that case that a term referring to a class of proteins by function actually refers to a diverse set of proteins with several independent evolutionary origins. The antifreeze proteins are a case in point. The term 'antifreeze protein' has been used to refer to a number of completely different protein families. While these proteins all play a role in protecting cells from the formation of ice crystals, they are not all related. Most of these proteins tend to be short, some as short as 16 amino acids, and most of them less than 90 amino acids.

One characteristic shared by most of the antifreeze proteins is that they tend to be very rich in Alanine, in some cases as much as half to two thirds of the residues. This presents a special challenge in evaluating whether two proteins are homologous. Is the degree of similarity we see due to similarity in amino acid composition, or can the similarity be taken as evidence of common ancestry for the two proteins being compared?

This tutorial will illustrate some approaches to evaluating sequence similarity between pairs of very closely-related sequences, distantly-related sequences, and between unrelated sequences with similar amino acid compositions.

The dataset found in the file uniprot-antifreeze-keyword-47.txt. This file was retrieved from the UniProt Database by searching for sequences containing the keyword 'antifreeze', and limiting the results to Molecular Function - Antifreeze protein.

1. Read entries into blprotein

Since this tutorial covers pairwise similarity, create a directory called 'pairwise' and save  uniprot-antifreeze-keyword-47.txt to that directory. cd into the pairwise directory, and start blprotein by typing 'blprotein'.



UniProt textfiles, containing complete annotation, cannot be read directly by blprotein. However, the file format is almost identical to that used by the EMBL database. The blprotein Import Foreign Format function uses readseq to interconvert file formats. Therefore, to read this file choose File --> Import Foreign Format, and type in the name of the file. 


 

2. Comparisons between two closely-related sequences

To illustrate a pairwise comparison between two closely-related sequences, select two antifreeze proteins, ANP1C_MACAM and ANP2_ANALU, . Before running any of the Smith-Waterman type alignment programs, it's worth doing a dot-matrix plot of these proteins using Similarity --> PXHOM, with COMPRESSION=4  and MIN. PERCENT SIMILARITY=30. Note: PXHOM is the protein counterpart of DXHOM.

P1HOM         Version  5/13/91
X-axis: ANP1C_MACAM
Y-axis: ANP2_ANALU
SIMILARITY RANGE: 10 MIN.PERCENT SIMILARITY: 30
SCALE FACTOR: 0.90 COMPRESSION: 4

40 80

U . . . . . . .
H . . . . . . .
F . . . . . . .
I . . . . . . .
P . . . . . . .
P . . . . . . .
OH . . . . . . .
FC . . . . . . .
CE. . . . . . .
40 ........GI............................................................
.N j . . . . .
.PP . . . . . .
. OK . . . . . .
. JG . . . . . .
. EF . . . . . .
. FE . . . . . .
. JK . . . . . .
. M . . . . . .
. L. . . . . .
80 ..................ML..................................................
. NP . . . . .


The output shows that both proteins show similarity above background along their entire length.

GGSEARCH performs rigorous global alignments. That is, the alignment is constructed so that all of both proteins must be included in the alignment. Choose Similarity --> GGSEARCH to get the GGSEARCH menu:

Click Run. Output is shown below:


The best scores are:                                      n-w bits E(1)
ANP2_ANALU 88 bp ( 88) 455 72.0 4.2e-218

>>ANP2_ANALU 88 bp (88 aa)
n-w opt: 455 Z-score: 365.0 bits: 72.0 E(): 4.2e-218
global/global (N-W) score: 455; 84.1% identity (92.0% similar) in 88 aa overlap (1-87:1-88)

10 20 30 40 50
ANP1C_ MKSVILTGLLFVLLCVDHMT-ASQSVVATQLIPINTALTPAMMEGKVTNPIGIPFAEMSQ
:::.:::::::::::::::. ::::::::::::::::::: ::.:.:.:: :::::::::
ANP2_A MKSAILTGLLFVLLCVDHMSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
10 20 30 40 50 60

60 70 80
ANP1C_ IVGKQVNTPVAKGQTLMPNMVKTYVAGK
::::::: ::: .:::::::::: :.:
ANP2_A IVGKQVNRAVAKDETLMPNMVKTYRAAK
70 80

When similarity is this high, it is easier to view the mismatches, rather than the matches:


 

The best scores are:                                      n-w bits E(1)
ANP2_ANALU 88 bp ( 88) 455 74.0 1.1e-232

>>ANP2_ANALU 88 bp (88 aa)
n-w opt: 455 Z-score: 375.5 bits: 74.0 E(): 1.1e-232
global/global (N-W) score: 455; 84.1% identity (92.0% similar) in 88 aa overlap (1-87:1-88)

10 20 30 40 50
ANP1C_ MKSVILTGLLFVLLCVDHMT-ASQSVVATQLIPINTALTPAMMEGKVTNPIGIPFAEMSQ
x x X x x x X
ANP2_A MKSAILTGLLFVLLCVDHMSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
10 20 30 40 50 60

60 70 80
ANP1C_ IVGKQVNTPVAKGQTLMPNMVKTYVAGK
XX Xx X x
ANP2_A IVGKQVNRAVAKDETLMPNMVKTYRAAK
70 80

3. Comparisons between two distantly-related sequences

Antifreeze proteins ANP11_MACAM  ANP3_MACAM, both from Macrozoarces americanus (ocean pout) are more highly-diverged.

Try it:
Select ANP11_MACAM, ANP3_MACAM
Similarity --> PXHOM
COMPRESSION: 2

P1HOM         Version  5/13/91
X-axis: ANP11_MACAM
Y-axis: ANP3_MACAM
SIMILARITY RANGE: 10 MIN.PERCENT SIMILARITY: 30
SCALE FACTOR: 0.90 COMPRESSION: 2

20 40 60

. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
20 ......................................................................
. . . . . . .
. . . . . . .
U . . . . . . .
O . . . . . . .
K . . . . . . .
I . . . . . . .
H . . . . . . .
I . . . . . . .
L . . . . . . .
40 .......Q..............................................................
V. . . . . . .
W . . . . . .
. . . . . . .
. c . . . . . .
. f . . . . . .
. e . . . . . .
. f . . . . . .
. . . . . . .
. . . . . . .
60 ......................................................................
. g. . . . . .
. . . . . . .
. .d . . . . .
. . c . . . . .
. . Z . . . . .
h . . . . . . .
i . . W . . . . .
h . . T . . . . .
i . . V . . . . .
80 ...........................X..........................................
. . Z. . . . .
. . f . . . .
. . . . . . .
. . . . . . .
. . . . . . .



Despite the fact that both sequences are from the same species, these proteins have diverged substantially. We can infer that these two copies of the gene must have diverged from a single ancestral copy over a long period of time, and both copies have been retained until modern times.

Surprisingly, comparison of the two sequences using GGSEARCH shows no significant similarity.

Try it:
Select ANP11_MACAM, ANP3_MACAM
Similarity --> GGSEARCH
GGSEARCH produces global alignments
version 35.04 Oct. 7, 2008
Query: bio8564379525050104483.tmp.seq1
1>>>ANP11_MACAM 62 bp - 62 aa
Library: bio8564379525050104483.tmp.seq2 91 residues in 1 sequences

0 residues in 0 sequences (range: 50-77)
Statistics: Altschul/Gish params: n0: 62 Lambda: 0.158 K: 0.019 H: 0.100
Algorithm: Global/Global Needleman-Wunsch (2007) (6.0 April 2007)
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
Scan time: 0.000
!! No sequences with E() < 0.010000


62 residues in 1 query sequences
91 residues in 1 library sequences
Scomplib [35.04]
start: Mon Sep 28 12:36:59 2009 done: Mon Sep 28 12:36:59 2009
Total Scan time: 0.000 Total Display time: 0.010


Yet, we know from the PXHOM output that there is a long region of similarity. In particular, PXHOM shows that the first 22 amino acids of ANP11_MACAM show a strong match with residues 22 through 44 of ANP3_MACAM. This illustrates an important distinction between global and local alignments. GGSEARCH performs alignments in which the entire length of both sequences must be forced into an alignment. GLSEARCH finds the highest-scoring local alignment between both sequences.

Try it:
Select ANP11_MACAM, ANP3_MACAM
Similarity --> GLSEARCH

GLSEARCH produces global query alignments
version 35.04 Oct. 7, 2008
Query: bio3874515111645596706.tmp.seq1
1>>>ANP11_MACAM 62 bp - 62 aa
Library: bio3874515111645596706.tmp.seq2 91 residues in 1 sequences

91 residues in 1 sequences (range: >50)
Statistics: (shuffled [500]) Unscaled normal statistics: mu= 14.4260 var=94.9384 Ztrim: 0
Algorithm: Global/Local Needleman-Wunsch (2007) (6.0 April 2007)
Parameters: BL50 matrix (15:-5), open/ext: -12/-2
Scan time: 0.050

The best scores are: n-w bits E(1)
ANP3_MACAM 91 bp ( 91) 211 50.6 8.2e-91

>>ANP3_MACAM 91 bp (91 aa)
n-w opt: 211 Z-score: 251.7 bits: 50.6 E(): 8.2e-91
global/local score: 211; 58.1% identity (72.6% similar) in 62 aa overlap (1-62:25-86)

10 20 30
ANP11_ SVVATQLIPINTALTPAMMEGKVTNPIGIPFAEMSQ
::::::::::::::: .:: .: : ::: .. .
ANP3_M MKSVILTGLLFVLLCVDHMSSANQSVVATQLIPINTALTLVMMTTRVIYPTGIPAEDIPR
10 20 30 40 50 60

40 50 60
ANP11_ IVGKQVNRIVAKGQTLMPNMVKTYAA
.:. :::. : : ::::.::: :
ANP3_M LVSMQVNQAVPMGTTLMPDMVKFYCLCAPKN
70 80 90


This example illustrates that forcing all of both sequences into the alignment effectively dilutes out the shorter regions of strong similarity by including regions of weak similarity. In this case, the reduced score was below the threshold for detection. A global search would have erroneously concluded that no similarity was found, even though the dot-matrix search clearly showed detectible similarity along most of the length of the sequences.

One point which may can sometimes be important is the estimate of statistical significance of an alignment.  The preferred application for evaluation of significance of sequence similarity is SSEARCH.

Try it:
Select ANP11_MACAM, ANP3_MACAM
Similarity --> SSEARCH

SSEARCH searches a sequence data bank
version 35.04 Oct. 7, 2008
Please cite:
T. F. Smith and M. S. Waterman, (1981) J. Mol. Biol. 147:195-197;
W.R. Pearson (1991) Genomics 11:635-650

Query: bio6278841360627469717.tmp.seq1
1>>>ANP11_MACAM 62 bp - 62 aa
Library: bio6278841360627469717.tmp.seq2 91 residues in 1 sequences

91 residues in 1 sequences
Statistics: (shuffled [500]) MLE statistics: Lambda= 0.2059; K= 0.148
Algorithm: Smith-Waterman (PGopt) (6.0 Mar 2007)
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
Scan time: 0.070

The best scores are: s-w bits E(1)
ANP3_MACAM 91 bp ( 91) 214 66.3 6.1e-17

>>ANP3_MACAM 91 bp (91 aa)
s-w opt: 214 Z-score: 336.6 bits: 66.3 E(): 6.1e-17
Smith-Waterman score: 214; 60.0% identity (75.0% similar) in 60 aa overlap (1-60:25-84)

10 20 30
ANP11_ SVVATQLIPINTALTPAMMEGKVTNPIGIPFAEMSQ
::::::::::::::: .:: .: : ::: .. .
ANP3_M MKSVILTGLLFVLLCVDHMSSANQSVVATQLIPINTALTLVMMTTRVIYPTGIPAEDIPR
10 20 30 40 50 60

40 50 60
ANP11_ IVGKQVNRIVAKGQTLMPNMVKTYAA
.:. :::. : : ::::.::: :
ANP3_M LVSMQVNQAVPMGTTLMPDMVKFYCLCAPKN
70 80 90

Whereas GGSEARCH and GLSEARCH use a shortcut method for finding similarities rapidly, SSEARCH does an exhaustive Smith-Waterman alignment, which requires longer processing time. In this case, the alignments are identical. However, the E-value calculated by GLSEARCH of 8.2 x 10-91 appears to be an overestimate of the significance of the alignment. The more rigorous SSEARCH calculates an E-value of only 6.1x10-17.  While this is still highly significant, in other cases, a difference in E-value of 74 orders of magnitude could mean the difference between detecting a similarity or not detecting it.

4. Comparison of sequences with skewed amino acid compositions

Not all similarities will be as straightforward to interpret as those shown above.  Inspection of the sequences in the blprotein window show that many of the antifreeze proteins contain Alanine-rich regions. In particular, ANP_NOTCO from  Notothenia coriiceps neglecta (Black rockcod) appears by eye to contain many internal repeats of an Ala-rich subunit. The shorter ANP4_PSEAM from Pseudopleuronectes americanus (Winter flounder) also has Ala-rich repeats. Are the two proteins related to a single commen ancestral protein, or is this a case of convergent evolution creating two Ala-rich proteins independently?

In this case, a dot-matrix comparison of the two proteins is typical for cases in which many tandem repeats are found.

Try it:
Select ANP4_PSEAM, ANP_NOTCO
COMPRESSION: 2


An excerpt is shown below:
P1HOM         Version  5/13/91
X-axis: ANP4_PSEAM
Y-axis: ANP_NOTCO
SIMILARITY RANGE:  10      MIN.PERCENT SIMILARITY:  30
SCALE FACTOR:    0.90      COMPRESSION:              2

               20        40        60        80

                . h    e f.  ef   f h   e    j.         .         .         .
             i hj g dZcbbdaZbcbdZcbdgghicbfbhgi         .         .         .
             igeg   eZUgYbZcUYhbZUiYdjfdf ZbYhfi        .         .         .
               e. e cXeRXWbXTRWbXeTXWdfiaeZZeXf j       .         .         .
             i fc e cWTSRWWWSRRWWTSSWWcbZZaYYYXgh       .         .         .
               hg g dbVRRRZYVRRSYVRRTgYdafaZXYZbj       .         .         .
             j hf d cZfSQQSYTSQSZYTQQVYfaZZZcXahc       .         .         .
              gf.dg afdXQdQWXdQQeWZSgSZefYYdbcaig       .         .         .
                d b aaeVVPeSZVPPSgZZR UiafYYfced        .         .         .
    20 .....i...d.h.ZgaeVePSeeVPfSgaeTVedaYdYffhf............................
The entire matrix is full of diagonals, because Ala-rich subsequences will always show similarity with other Ala-rich sequences.

To get around this problem, the FASTA programs GGSEARCH, GLSEARCH, SSEARCH and LALIGN all shufffle one of the sequences numerous times and repeat the search.

Rather than shuffling the entire sequence as a single unit, these programs shuffle amino acids or nucleotides in a small sliding window. Shuffling begins at one end and moves down the length of the sequence. What this accomplishes is to create a random sequence which preserves local fluctuations in sequence composition that were found in the original sequence.

Local shuffling controls for the hypothesis that the two sequences show similarity because the each have similar local variations in amino acid compositions eg. in DNA, both sequences have an AT-rich followed by a GC-rich region.  If the program finds significant similarity between the two unshufffled sequences, but not between one original and the population of shuffled sequences, we can reject the hypothesis that similarity is due to local variations in amino acid or nucleotide composition.

By default, the FASTA programs generate 500 randomized sequences, and calculate the E-value based on similarity scores obtained with the population of randomized sequences. The more shuffles that are done, the more accurate the E-value will be.

Try it:
Select ANP4_PSEAM, ANP_NOTCO
Similarity --> SSEARCH

SSEARCH searches a sequence data bank
version 35.04 Oct. 7, 2008
Please cite:
T. F. Smith and M. S. Waterman, (1981) J. Mol. Biol. 147:195-197;
W.R. Pearson (1991) Genomics 11:635-650

Query: bio7603365334708550294.tmp.seq1
1>>>ANP4_PSEAM 85 bp - 85 aa
Library: bio7603365334708550294.tmp.seq2 790 residues in 1 sequences

790 residues in 1 sequences
Statistics: (shuffled [500]) MLE statistics: Lambda= 0.1216; K=1.247e+05
Algorithm: Smith-Waterman (PGopt) (6.0 Mar 2007)
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
Scan time: 0.800
!! No sequences with E() < 0.010000


85 residues in 1 query sequences
790 residues in 1 library sequences

We can conclude that these two sequences are analogous, resulting from convergent evolution, rather than homologous, and descended from a common ancestor.