Guaranteed alignments with LAST
===============================

This document describes how to guarantee finding alignments with a
limited number of mismatches or gaps.

** This is mostly of theoretical interest!  In most cases, you are **
** advised to use the methods in last-cookbook.rst and documents   **
** referred to thence.  You should probably ignore this document!  **

The first step of lastal is to find initial matches:

  * For each possible start position in the query: find the shortest
    match with length >= l that EITHER occurs <= m times in the
    reference, OR has length L.

By default, m=10 and L=infinity.  This typically works well, but it
does not guarantee to find alignments.  For example, if a short query
sequence perfectly matches more than ten locations in the reference,
no alignments will be found.

1. Counting exact matches
-------------------------

This example counts all exact matches of length >= 30:

  lastdb -m1 genomedb genome.fa
  lastal -l30 -j0 genomedb reads.fa

The -m1 option ensures that the initial matches are exact matches.
The -l30 option requests initial matches of length >= 30.  Finally,
the -j0 option tells lastal to simply count initial matches and not
extend alignments.  When -j0 is used, lastal applies no restriction on
how often the matches occur: it counts all matches of length >= 30.

2. Guaranteeing to align reads with up to N mismatches
------------------------------------------------------

Suppose we wish to find all alignments of length-36 reads to the
genome, allowing up to two mismatches.  A naive approach is to start
by finding all exact matches of length 12, and extend alignments from
these.  This works because any length-36 read with two mismatches is
guaranteed to have an exact match of length 12.  It will be very slow,
however, because there will be many unproductive length-12 matches.

We can do better by finding matches using a spaced seed, and then
extending alignments.  For example, our reads are guaranteed to have a
match using this spaced seed pattern: 11111011000111110110001111.
Since this seed has 18 matched positions (18 "1"s), we will get far
fewer unproductive matches.  With LAST, we can do this as follows:

  lastdb -m11111011000 mydb genome.fa
  lastal -L26 -m0 -j1 -q0 -d34 -n100 mydb reads.fa

In the lastdb command, the seed pattern gets cyclically repeated, so
we only need to specify the repeating unit of the pattern.  In the
lastal command, we used -L26 to get length-26 initial matches, and -m0
to suppress shorter initial matches.  We also used -j1 to request
gapless alignments, -q0 to set the mismatch cost to 0, and -d34 to
request alignments with score >= 34.  Finally, -n100 limits the
output: if one read has many matches, an arbitrary sample (at least
100) of them will be returned.

The following table shows optimal spaced seed patterns for various
read sizes and numbers of mismatches.  Each entry shows the match
length (e.g. 26) and the pattern (e.g. 11111011000).

====  ===========  ================  ==================  ======================
Read  1 mismatch   2 mismatches      3 mismatches        4 mismatches
size
====  ===========  ================  ==================  ======================
16    12 11110     10 1110100         9 11010000          3 1110
17    13 11110     10 1110100        10 11010000          5 1110
18    14 11110     12 1110100        10 11010000          5 1110
19    14 11110     12 1110100        12 11010000          5 1110
20    16 11110     12 1110100        12 11010000         11 1100010000
21    17 11110     15 1110100        12 11010000         12 1100010000
22    18 11110     16 1110100        13 1110100000       12 1100010000
23    19 11110     17 1110100        13 11101001000      12 1100010000
24    19 111110    17 1110100        14 11101001000      12 1100010000
25    20 111110    19 1110100        14 11101001000      16 1100010000
26    21 111110    19 1110100        16 11101001000      16 1100010000
27    22 111110    19 1110100        16 11101001000      15 1110100000000
28    23 111110    22 1110100        16 11101001000      16 1110100000000
29    23 111110    23 1110100        19 11101001000      16 1110100000000
30    25 111110    24 1110100        19 11101001000      18 1110100000000
31    26 111110    24 1110100        19 1110110100000    18 1110100000000
32    27 111110    26 1110100        18 111101011001000  18 111010010000000
33    28 111110    26 1110100        19 111101011001000  18 111010010000000
34    29 111110    22 1111101110010  19 111101011001000  20 111010010000000
35    29 1111110   25 11111011000    21 111101011001000  20 111010010000000
36    30 1111110   26 11111011000    21 111101011001000  20 11110010000001000
37    31 1111110   27 11111011000    23 111101011001000  21 11110010000001000
38    32 1111110   27 11111011000    24 111101011001000  21 11110010000001000
39    33 1111110   29 11111011000    24 111101011001000  21 11110010000001000
40    34 1111110   30 11111011000    24 111101011001000  24 11110010000001000
41    34 1111110   30 11111011000    27 111101011001000  24 11110010000001000
42    36 1111110   30 1111110101100  27 111101011001000  24 1101110100000010000
43    37 1111110   31 1111110101100  27 111101011001000  25 1101110100000010000
44    38 1111110   32 1111110101100  32 1110110100000    25 1101110100000010000
45    39 1111110   32 1111110101100  31 111101011001000  27 1101110100000010000
46    40 1111110   34 1111110101100  32 111101011001000  28 1101001110100000000
47    41 1111110   35 1111101110010  33 111101011001000  28 1101001110100000000
48    41 11111110  35 1111101110010  34 111101011001000  30 1101001110100000000
49    42 11111110  37 1111110101100  34 111101011001000  30 1101001110100000000
50    43 11111110     ?              36 111101011001000  30 1101001110100000000
====  ===========  ================  ==================  ======================

This table was made using software kindly provided by the authors of
these publications:

* G Kucherov, L Noé, M Roytberg (2005) IEEE/ACM Trans Comput Biol
  Bioinform 2:51-61.
* S Burkhardt, J Kärkkäinen (2003) Fundamenta Informaticae 56:51-70.

For longer reads, it becomes harder to determine the optimal patterns.

3. Guaranteeing to align reads with up to N mismatches or indels
----------------------------------------------------------------

LAST has no clever way to do this, only brute force.  Any alignment of
a length-L read, with N mismatches or insertions or deletions, must
include an exact match of this length:

  int[ L / (N+1) ]

Here, "int" means round fractions down.  For example, if L=72 and N=2,
there must be an exact match of length 24.  We can find the alignments
like this:

  lastdb -m1 genomedb genome.fa
  lastal -L24 -m0 -d0 -y0 -a0 -x2 -e68 genomedb reads.fa

The lastdb -m1 option requests exact initial matches, the -L24 sets
their length to 24, and -m0 suppresses shorter initial matches.  The
-e68 sets the alignment score threshold to 68, which is the minimum
score for a size-72 read with 2 differences.  The -x2 just makes it
faster, by quitting alignments early if it finds more than 2
differences.