Oct. 20, 2015
TCOFFEE documentation:
Tutorial - $doc/tcoffee/t_coffee_tutorial.html
Manual - $doc/tcoffee/t_coffee_technical.html
Paper - $doc/tcoffee/t_coffee.pdf

DIALIGN Web site
DIALIGN-TX Help Page $doc/dialign/
MRTRANS documentation: $doc/fasta/mrtrans.txt
Jalview Manual $doc/jalview/contents.html

MAFFT documentation:
MAFFT Web Site -
Manual - $doc/mafft/Manpage_of_MAFFT.html
Paper - Katoh K, Standley DM (2013) MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 30:772-780 doi: 10.1093/molbev/mst010

Example: Plant Defensins

A test dataset of GenBank entries for plant antifungal proteins called defensins can be found in defensin.gen. Using the FEATURES program, as described in previous tutorials, the protein coding sequences (CDS) were extracted from this file, and stored in defensin.CDS.fsa. Following this, the amino acid sequences were found selecting all the CDS from defensin.CDS.fsa  and by running DNARNA --> Ribosome.  The result was saved as defensin.protein.fsa.
Warning! Most multiple alignment programs will align either DNA or amino acid sequences. However, it's important to know that unless nucleic acid sequences are very closely-related, with few gaps (eg. tRNA, rRNA genes), a reliable multiple alignment (or for that matter, even a pairwise alignment) is almost impossible. The reason is that nucleic acids use a 4-letter alphabet, allowing many equally good alignments to form a set of sequences. In contrast, the 20-letter amino acid alphabet drastically decreases the number of possible alignments, so that an obvious 'correct' alignment is usually possible to find.

1. TCOFFEE - Global multiple alignment by clustering

Launch blprotein and run File--> Open to read in defensin.protein.fsa.


Select all amino acid sequences and open the Alignment menu, which shows programs related to multiple sequence alignment

Choose TCOFFEE, which brings up the menu

TCOFFEE sends the alignment to a new blpalign window.

Choose File --> Save ALL As and save the t_coffee protein alignment as

Additional output files:

Warning!: Do not use guide trees generated by clustal or other multiple alignment programs for any purpose eg. phylogenetic analysis of sequence or species evolution.  These trees are based on pairwise alignments, and therefore do not contain the evolutionary information found in the gaps that are present in the completed alignment. Once you have an alignment, you can then go back and construct a phylogenetic tree.

Note on TCOFFEE - To speed up the alignment, TCOFFEE will break up the problem into many parts, allocating them to new tcoffee jobs. You can see these jobs using the Unix top or ps commands. Initially, there will be many tcoffee jobs. As the profiles for sub-alignments are merged, the number of tcoffee jobs will decrease, until the final alignment is completed in a single tcoffee job.

2. DIALIGN-TX - Segment-based local alignment

Select all sequences as above, and choose Alignment --> DIALIGN-TX. One of the main points of DIALIGN-TX is that normally, one should not have to set any parameters. In fact, although a number of parameters can be set in this program, it is often dangerous to change them. In no case should you change the parameters without carefully reading the publications on DIALIGN. So, in most cases, simply click on 'Run' and run the program.

While the alignments differ, it is hard to say whether one is "better" than the other.  Both programs align the motifs containing the Cys residues. While TCOFFEE tends to insert leading gaps which prevent the N-terminal Met residues from aligning, DIALIGN-TX  does not insert any gaps in the N-terminus. The order of sequences in the final alignment is different, seeming to imply a disagreement on the presumed phylogenetic relationships of the sequences. However, Neighbor Joining trees constructed from these alignments are identical (not shown).

3. MAFFT - Multiple Alignment by Fast Fourier Transformation

MAFFT is a program that implements a wide variety of alignment strategies.

One of the unique things to note about MAFFT is that it will automatically choose the best algorithm for alignment, based on the number of input sequences, unless a method is specified. The methods cover most of the situations typically encountered in multiple alignments.

Because the type of problem will vary is is best to consult the MAFFT Algorithms page to see which algorithm best applies to your specific dataset. For relatively small numbers of sequences (eg. < 200 sequence) methods such as E-INS-i, L-INS-i and GINS-i construct a guide tree based on pairwise distances and then traverse the tree until all sequences have been added, similar to TCOFFEE.The faster FFT--NS-1 and FFT-NS-2 estimate between-sequence distances based on frequencies of 6-mer oligonucleotides. Methods also vary depending on whether or not they refine the initial alignment by iteratively aligning subsets of aligned sequences, and by whether they recalculate the guide tree based on the first alignment,  and then repeat the alignment. Obviously, for short numbers of sequences, the slower more accurate methods are preferred.

To launch MAFFT, choose Alignment --> MAFFT.  Output is sent directly to blpalign. Try comparing the output using the FFT-NS-2 and FFT-NS-i, which essentially compares a fast progressive method with a slower iterative method.



In this trivial example, the alignments are done almost instantly, and are identical.

4. MRTRANS - Alignment of DNA sequences, using protein alignment information

As mentioned previously, alignments using DNA sequences are unreliable. However, there is a very good reason to want aligned DNA sequences, rather than proteins. Due to the degeneracy of the genetic code, mutations at the DNA level often do not result in amino acid substitutions. These so-called silent mutations contain a great deal of additional evolutionary information that is lost in amino acid sequences. Consequently, it would be better to do phylogenetic analysis on DNA alignments, if such alignments could be done reliably

MRTRANS by Bill Pearson aligns DNA sequences using the corresponding amino acid alignment as a guide. Thus, if you have aligned a set of amino acid sequences, it is straightforward to generate the corresponding DNA alignment. MRTRANS requires two files for input, in Pearson/FASTA format: a file containing the unaligned DNA coding sequences, and the second file containing the corresponding amino acid sequences, aligned by a program such as TCOFFEE, MAFFT or DIALIGNX. When we run MRTRANS from blpalign, blpalign will automatically generate a protein alignment file from the proteins selected. Thus, all the user needs to do is to select a file containing the corresponding DNA CDS sequences, unaligned. The following example illustrates this process.
1) MRTRANS needs the DNA and aligned amino acid sequences to have the same names. During the extraction process, during translation, the names may be modified, so it may be necessary to change names in 'File --> Get Info', before you export to a .wrp file
2) Where two or more copies of a gene are present in a single entry (eg. CAGTHIOGN:CDS1 and CAGTHIOGN:CDS2), it is necessary to give them each unique names so that MRTRANS can distinguish them. Since the CDS extensions will be removed when MRTRANS is run, one solution is to delete CDS but retain the number (eg CAGTHIOGN_1 and CAGTHIOGN_2).

The current biolegato instances implementation of TCOFFEE and DIALIGN-TX usually handle these steps automatically. Nonetheless, if you are having problems with mrtrans, make sure that the names of sequences are the same for both protein and nucleic acid sequences.

Continuing with our earlier example,  if you don't still have the the T_COFFEE alignment in a blpalign window, read it in to blpalign from Choose Edit --> Select All.

To run MRTRANS, choose Alignment --> MRTRANS, and choose the DNA filename.

This is a very easy step to mess up. Make sure to choose the correct file for DNA unaligned DNA coding sequences.

The aligned DNA sequences appear in a new blnalign window:

5. Manual refinement of multiple alignments.

Multiple alignment programs aren't perfect, and are not guaranteed to create the optimal alignment.  As well, they cannot utilize knowledge other than sequence data. Therefore, it's always a good idea to inspect a multiple alignment, and edit the alignment before using it in a phylogeny. One common artifact that often occurs is the creation of gaps which span all sequences. Such gap positions are meaningless, and should be edited out. Knowledge of protein domain structure, or other biological knowledge, may also suggest modifications that should be made to an alignment.

Alignments can be edited directly in biolegato instances. By default, only gaps can be edited out, although it is possible to delete amino acids or nucleotides if protections are changed (Edit --> GetInfo).

Sometimes, it is useful to set several sequences to act as a group.
For example at position 40 in the TCOFFEE alignment, most sequences have either Alanine (A) or Serine (S), shown highlighted at right. TCOFFEE has chosen to insert a big gap so that this block aligns with an A in the bottom sequence, ZMA13350. However, it would probably be just as valid if the gap had been inserted before this block, rather than after, so that the block would align with the Serine in ZMA133630. We can move the block over as follows.

First select all of the sequences by name. Click on the topmost sequence, hold the shift key, and click on GMU12150. This will highlight all but ZMA133530. We can make these sequences function as a group by choosing Edit --> Group. The blpalign window will now look like this:

The sequences that we have selected all have a '1' at the left of the name field. This indicates that they are all members of a group labeled 1. Any edit done on any sequence in the group will now take effect on all sequences in the group.

Now, insert gap charcters to shift these amino acids to line up with the S. This can be done by clicking before any of the amino acids in group 1 (eg. the E at position 40, row 1) and pressing the dash (-) key until they line up with the S.

Finally, delete the original gap characters using the Backspace key or the Delete key.

See the online Help in BioLegato for a more in depth  description of how to edit alignments.

6. Displaying and printing multiple alignments

a) REFORM - Textual display

In many cases you need an alignment displayed as editable text. This might be true if you wanted to be able to import the alignment into a word-processor or HTML editor for further modification, such as coloring or underlining certain characters.  Choosing Alignment --> REFORM will print out an alignment in which amino acids matching the consensus are indicated by dots:

10 20 30 40 50 60 70

Various features of the output can be changed in the REFORM menu For example, to print ALL amino acids at every position, the REFORM menu would be set as follows:

10 20 30 40 50 60 70
Maxxxkxxa xxxlxmxLxxatxxx xxxxxCxx xsxxfkglcxsxxxCxx
AF112443_Cmarsiyfma flvlamtlfvaygvq gkeiccke ltkpvk cssdplcqk
AF128239_Cmarsiyfma flvlavtlfvangvq gqnnickt tskhfkglcfadskcrk
BOAJ5280_Cmkntvklsligfvmltvlllgetvia qkrkpcys qepd ktcevn rcka
BOAJ5281_Cmkntvklsligfvmltvlllgetvia qkrkpcys qepd ktcevn rcka
CAGTHIOGN_magfskvia tiflmmmlvfatgmv aeartces qshrfkglcfsksncgs
CAGTHIOGN_magfskvia tiflmmmlvfatdmm aeakicea lsgnfkglclssrdcgn
CAGT_CDS1 magfskvva tiflmmllvfatdmm aeakicea lsgnfkglclssrdcgn
GMU12150_Cmsrsvplvs ticvlllllvatemmgptmvaeartces qshrfkgpclsdtncgs
ZMA133530_mr ivymaav mclvlatmss tspsfcqaggcigcprappppsdetcyedlkcsasrchl


b) JALVIEW - Graphic display and alignment

Jalview is a feature-rich sequence alignment viewer. It can be launched by selecting an alignment in blnalign or blpalign, and choosing 'Alignment --> Jalview'.  (Note: since Jalview also performs multiple alignments, bldna and blprotein also have options to launch Jalview).

The alignment above is shown using one of several color schemes available. The Hydrophobicity color scheme shows hydrophobic residues in red, hydrophilic residues in blue, and residues of intermediate hydrophobicity in varying shades of purple.

The alignment can be written in paginated form, suitable for printing, by saving in a variety of formats.

Jalview can also do complete alignments from unaligned sequences. For a full description of the capabilities of Jalview, see