This is the on-line help file for Clustal X, using the NCBI Vibrant Toolkit.   

It should be named or defined as: clustalx_help 
except with MSDOS in which case it should be named Clustal X.HLP

For full details of usage and algorithms, please read the CLUSTALW.DOC file.


Toby  Gibson
Des   Higgins
Julie Thompson

EMBL, Heidelberg, Germany.     May 1994.


>>HELP G <<
                      General help for CLUSTAL X 

Clustal X is a general purpose multiple alignment program for DNA or proteins,
using a window interface for sequence input and display.

SEQUENCE INPUT:  sequences (and profiles) are input using the FILE menu.
Invalid options will be disabled. All sequences must be in 1 file, one after
another. 6 formats are automatically recognised: NBRF/PIR, EMBL/SWISSPROT, 
Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup) and GDE flat file.
All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
except "-" which is used to indicate a GAP ("." in GCG/MSF).  

Clustal X has two modes which can be selected using the switch directly above
the sequence display: MULTIPLE ALIGNMENT MODE and PROFILE ALIGNMENT MODE.

To do a MULTIPLE ALIGNMENT on a set of sequences, make sure MULTIPLE ALIGNMENT
MODE is selected. A single sequence data area is then displayed. The 
ALIGNMENT menu then allows you to either produce a guide tree for the alignment,
or to do a multiple alignment following the guide tree, or to do a full multiple
alignment.

In PROFILE ALIGNMENT MODE, two sequence data areas are displayed, allowing you
to align 2 alignments (or profiles). Profiles are also used to add a new
sequence to an old alignment, or to use secondary structure to guide the
alignment process.  GAPS in the old alignments are indicated using the "-" 
character.   PROFILES can be input in ANY of the allowed formats; just 
use "-" (or "." for MSF) for each gap position.

PHYLOGENETIC TREES can be calculated from old alignments (read in with "-"
characters to indicate gaps) OR after a multiple alignment while the alignment
is still displayed.

The alignment is displayed on the screen with the sequence names on the left
hand side. The sequence alignment is for display only, it cannot be edited here
(except for changing the sequence order by cutting-and-pasting on the
sequence names). 

A ruler is displayed below the sequences, starting at 1 for the first residue
position (residue numbers in the sequence input file are ignored).

The line above the ruler is used to mark strongly conserved positions. Three
characters ('*', ':' and '.') are used:
'*' indicates positions which have a single, fully conserved residue
':' indicates that one of the following 'strong' groups is fully conserved:-
                 STA  
                 NEQK  
                 NHQK  
                 NDEQ  
                 QHRK  
                 MILV  
                 MILF  
                 HY  
                 FYW  

'.' indicates that one of the following 'weaker' groups is fully conserved:-
                 CSA  
                 ATV  
                 SAG  
                 STNK  
                 STPA  
                 SGND  
                 SNDEQK  
                 NDEQHK  
                 NEQHRK  
                 FVLIM  
                 HFY  

These are all the positively scoring groups that occur in the Gonnet Pam250
matrix. The strong and weak groups are defined as strong score >0.5 and weak
score =<0.5 respectively.

For profile alignments, secondary structure and gap penalty masks are displayed
above the sequences, if any data is found in the profile input file.

>>HELP F <<
                      Input / Output Files 

LOAD SEQUENCES reads sequences from one of 6 file formats, replacing any
sequences that are already loaded. All sequences must be in 1 file, one after
another. The formats that are automatically recognised are: NBRF/PIR,
EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup) and GDE
flat file.  All non-alphabetic characters (spaces, digits, punctuation marks)
are ignored except "-" which is used to indicate a GAP ("." in GCG/MSF).

The program tries to automatically recognise the different file formats used
and to guess whether the sequences are amino acid or nucleotide.  This is not
always foolproof.

FASTA and NBRF/PIR formats are recognised by having a ">" as the first 
character in the file.  

EMBL/Swiss Prot formats are recognised by the letters
ID at the start of the file (the token for the entry name field).  

CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.

GCG/MSF format is recognised by the word PileUp at the start of the file.  If
your msf files do not contain this word first, edit it in at the start
of the first line.  

If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the
sequence will be assumed to be nucleotide.  This works in 97.3% of cases
but watch out!


LOAD PROFILE 1 reads sequences in the same 6 file formats, replacing any
sequences already loaded as Profile 1. This option will also remove any
sequences which are loaded in Profile 2.

LOAD PROFILE 2 reads sequences in the same 6 file formats, replacing any
sequences already loaded as Profile 2.

APPEND SEQUENCES is only valid in MULTIPLE ALIGNMENT mode. The input sequences
do not replace those already loaded, but are appended at the end of the
alignment.

SAVE SEQUENCES AS... offers the user a choice of one of five output formats:
CLUSTAL, NBRF/PIR, GCG/MSF, PHYLIP or GDE. All sequences are written to a
single file. Options are available to switch between UPPER/LOWER case for
GDE files, and to output SEQUENCE NUMBERING for CLUSTAL files.

SAVE PROFILE 1 AS... is similar to the Save Sequences option except that only
those sequences in Profile 1 will be written to the output file.

SAVE PROFILE 2 AS... is similar to the Save Sequences option except that only
those sequences in Profile 2 will be written to the output file.


WRITE SEQUENCES TO PS will write the sequence display to a postscript format
file. This will include any secondary structure / gap penalty mask 
information and the consensus and ruler lines which are displayed on the
screen. The Alignment Quality curve can be optionally included in the output
file.

WRITE PROFILE 1 TO PS is similar to Write Sequences to PS except that only
the profile 1 display will be printed.

WRITE PROFILE 2 TO PS is similar to Write Sequences to PS except that only
the profile 2 display will be printed.


POSTSCRIPT PARAMETERS

A number of options are available to allow you to configure your postscript
output file.

PS COLORS FILE:

The exact RGB values required to reproduce the colors used in the alignment
window will vary from printer to printer. A PS colors file can be specified
that contains the RGB values for all the colors required by each of your
postscript printers.

By default, Clustal X looks for a file called 'colprint.par' in the current
directory (if your running under UNIX, it then looks in your home directory,
and finally in the directories in your PATH environment variable). If no PS
colors file is found or a color used on the screen is not defined here, the
screen RGB values (from the Color Parameter File) are used.

The PS colors file consists of one line for each color to be defined, with the
color name followed by the RGB values (on a scale of 0 to 1). For example,

RED          0.9 0.1 0.1

Blank lines and comments (lines beginning with a '#' character) are ignored.


PAGE SIZE:  The alignment can be displayed on either A4 or A3 pages.

ORIENTATION: The alignment can be displayed on either a landscape or portrait
page.

PRINT HEADER: An optional header including the postscript filename, and
creation date can be printed at the top of each page.

PRINT QUALITY CURVE: The Alignment Quality curve which is displayed underneath
the alignment on the screen can be included in the postscript output.

RESIZE TO FIT PAGE: By default, the alignment is scaled to fit the page size
selected. This option can be turned off, in which case a font size of 10 will
be used for the sequences.

PRINT FROM/TO RESIDUE: A range of the alignment can be printed.  The default
is to print the full alignment. The first and last residues to be printed
are specified here.

USE BLOCK LENGTH: The alignment can be divided into blocks of residues. The
number of residues in a block is specified here. More than one block may then
be printed on a single page. This is useful for long alignments of a small
number of sequences. If the block length is set to 0, The alignment will not
be divided into blocks, but printed across a number of pages.

>>HELP E <<
                          Editing Alignments

Clustal X allows you to change the order of the sequences in the alignment, by
cutting-and-pasting the sequence names.

To select a group of sequences to be moved, click on a sequence name and drag
the cursor until all the required sequences are highlighted. Holding down the
Shift key when clicking on the first name will add new sequences to those
already selected.

The selected sequences can be removed from the alignment by using the EDIT menu,
CUT option.

To add the cut sequences back into an alignment, select a sequence by clicking
on the sequence name. The cut sequences will be added to the alignment,
immediately following the selected sequence, by the EDIT menu, PASTE option.

To add the cut sequences to an empty alignment (eg. when cutting sequences from
Profile 1 and pasting them to Profile 2), click on the empty sequence name
display area, and select the EDIT menu, PASTE option as before.

The sequence selection and sequence range selection can be cleared using the
EDIT menu, CLEAR SEQUENCE SELECTION and CLEAR RANGE SELECTION options
respectively.

>>HELP M <<
                          Multiple Alignments

Make sure MULTIPLE ALIGNMENT MODE is selected, using the switch directly above
the sequence display area. Then, use the ALIGNMENT menu to do multiple
alignments.

Multiple alignments are carried out in 3 stages:
 
1) all sequences are compared to each other (pairwise alignments);
 
2) a dendrogram (like a phylogenetic tree) is constructed, describing the
approximate groupings of the sequences by similarity (stored in a file).
 
3) the final multiple alignment is carried out, using the dendrogram as a guide.

The 3 stages are carried out automatically by the DO COMPLETE ALIGNMENT option.
You can skip the first stage (pairwise alignments; guide tree) by using an
old guide tree file (DO ALIGNMENT FROM TREE); or you can just produce the guide
tree with no final multiple alignment (DO COMPLETE ALIGNMENT).


REALIGN SELECTED SEQUENCES is used to realign badly aligned sequences in the
alignment. Sequences can be selected by clicking on the sequence names - see
Editing Alignments for more details. The unselected sequences are then 'fixed'
and a profile is made including only the unselected sequences. Each of the
selected sequences in turn is then realigned to this profile. The realigned
sequences will be displayed as a group at the end the alignment.


REALIGN SELECTED SEQUENCE RANGE is used to realign a small region of the 
alignment. A residue range can be selected by clicking on the sequence display
area. A multiple alignment is then performed, following the 3 stages described
above, but only using the selected residue range. Finally the new alignment
of the range is pasted back into the full sequence alignment.


RESET GAPS BETWEEN ALIGNMENTS will remove any new gaps introduced into the
sequences during multiple alignment if you wish to change the parameters and
try again.  This only takes effect just before you do a second multiple
alignment.  You can make phylogenetic trees after alignment whether or not this
is ON.  If you turn this OFF, the new gaps are kept even if you do a second
multiple alignment. This allows you to iterate the alignment gradually.
Sometimes, the alignment is improved by a second or third pass.


SAVE LOG FILE will write the alignment calculation scores to a file. The log
filename is the same as the input sequence filename, with an extension .log
appended.


ALIGNMENT PARAMETERS displays a sub-menu with the following options:

Pairwise Alignment parameters control the speed/sensitivity of the initial
alignments.

Multiple Alignment parameters control the gaps in the final multiple
alignments.

Protein Gap Parameters displays a temporary window which allows you to set
various parameters only used in the alignment of protein sequences.


OUTPUT FORMAT OPTIONS allows you to choose from 5 different alignment formats
(CLUSTAL, GCG, NBRF/PIR, PHYLIP and GDE).  


ALIGNMENT PARAMETERS
--------------------


PAIRWISE ALIGNMENT PARAMETERS

A distance is calculated between every pair of sequences and these are
used to construct the phylogenetic tree which guides the final multiple
alignment. The scores are calculated from separate pairwise alignments.  These
can be calculated using 2 methods: dynamic programming (slow but accurate) or
by the method of Wilbur and Lipman (extremely fast but approximate).   

You can choose between the 2 alignment methods using the PAIRWISE ALIGNMENTS
option.  The slow/accurate method is fine for short sequences but will be
VERY SLOW for many (e.g. >20) long (e.g. >1000 residue) sequences.   


SLOW/ACCURATE alignment parameters:

These parameters do not have any affect on the speed of the alignments.  They
are used to give initial alignments which are then rescored to give percent
identity scores.  These % scores are the ones which are displayed on the 
screen.  The scores are converted to distances for the trees.

Gap Open Penalty:      the penalty for opening a gap in the alignment.

Gap extension penalty: the penalty for extending a gap by 1 residue.

Protein weight matrix: the scoring table which describes the similarity of 
each amino acid to each other.  For DNA, a hard-coded matrix is used. See
the Multiple alignment parameters, MATRIX option below for more details.


FAST/APPROXIMATE alignment parameters:

These similarity scores are calculated from fast, approximate, global align-
ments, which are controlled by 4 parameters.   2 techniques are used to make
these alignments very fast: 1) only exactly matching fragments (k-tuples) are
considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
are used.


K-TUPLE SIZE:  This is the size of exactly matching fragment that is used. 
INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
For longer sequences (e.g. >1000 residues) you may need to increase the default.


GAP PENALTY:   This is a penalty for each gap in the fast alignments.  It has
little affect on the speed or sensitivity except for extreme values.


TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
dot-matrix plot) is calculated.  Only the best ones (with most matches) are
used in the alignment.  This parameter specifies how many.  Decrease for speed;
increase for sensitivity.


WINDOW SIZE:  This is the number of diagonals around each of the 'best' 
diagonals that will be used.  Decrease for speed; increase for sensitivity.


MULTIPLE ALIGNMENT PARAMETERS

These parameters control the final multiple alignment.  This is the core of
the program and the details are complicated.  To fully understand the use
of the parameters and the scoring system, you will have to refer to the
documentation.

Each step in the final multiple alignment consists of aligning two alignments 
or sequences.  This is done progressively, following the branching order in 
the GUIDE TREE.  The basic parameters to control this are two gap penalties and
the scores for various identical/non-indentical residues.  

The GAP OPENING AND EXTENSION PENALTIES can be set here.  These control the 
cost of opening up every new gap and the cost of every item in a gap.  
Increasing the gap opening penalty will make gaps less frequent.  Increasing 
the gap extension penalty will make gaps shorter.   Terminal gaps are not 
penalised.

The DELAY DIVERGENT SEQUENCES switch, delays the alignment of the most
distantly related sequences until after the most closely related sequences have 
been aligned.   The setting shows the percent identity level required to delay
the addition of a sequence; sequences that are less identical than this level
to any other sequences will be aligned later.

For DNA, the scoring system assigns a score of 1 for two identical bases
and zero otherwise.   The TRANSITION WEIGHT gives transitions (A <--> G
or C <--> T i.e. purine-purine or pyrimidine-pyrimidine substitutions) a score
between 0 and 1; a score of zero means that the transitions are scored as
mismatches. For distantly related DNA sequences, the weight should be near to
zero; for closely related sequences it can be useful to assign a higher score.


The MATRIX option allows you to choose a series of weight matrices. For protein
alignments, you use a weight matrix to determine the similarity of non-identical
amino acids.  For example, Tyr aligned with Phe is usually judged to be 'better'
than Tyr aligned with Pro.   These are not used with DNA.

There are three 'in-built' series of weight matrices offered.  Each consists
of several matrices which work differently at different evolutionary distances.
To see the exact details, read the documentation.  Crudely, we store several
matrices in memory, spanning the full range of amino acid distance (from
almost identical sequences to highly divergent ones).   For very similar
sequences, it is best to use a strict weight matrix which only gives a high
score to identities and the most favoured conservative substitutions.  For
more divergent sequences, it is appropriate to use "softer" matrices which
give a high score to many other frequent substitutions.

1) BLOSUM (Henikoff).   These matrices appear to be the best available for 
carrying out data base similarity (homology searches).  The matrices used are:
Blosum80, 62, 40 and 30.

2) PAM (Dayhoff).  These have been extremely widely used since the late '70s.
We use the PAM 120, 160, 250 and 350 matrices.

3) GONNET . These matrices were derived using almost the same
procedure as the Dayhoff one (above) but are much more up to date and are based
on a far larger data set.  They appear to be more sensitive than the Dayhoff
series.  We use the GONNET 40, 80, 120, 160, 250 and 350 matrices.

We also supply an identity matrix which gives a score of 10 to two identical 
amino acids and a score of zero otherwise.  This matrix is not very useful.
Alternatively, you can read in your own (just one matrix, not a series).

A new matrix can be read from a file on disk, if the filename consists only
of lower case characters. The scores in the new weight matrix should be
similarities. You can use negative as well as positive values if you wish,
although the matrix will be automatically adjusted to all positive scores,
unless the NEGATIVE MATRIX option is selected.

INPUT FORMAT  The format used for a new matrix is the same as the BLAST program.
Any lines beginning with a # character are assumed to be comments. The first
non-comment line should contain a list of amino acids in any order, using the
1 letter code, followed by a * character. This should be followed by a square
matrix of scores, with one row and one column for each amino acid. The last
row and column of the matrix (corresponding to the * character) contain the
minimum score over the whole matrix.

For DNA alignments, a single hard-coded matrix is used. This is the default
scoring matrix used by BESTFIT for the comparison of nucleic acid sequences.
X's and N's are treated as matches to any IUB ambiguity symbol.  All matches
score 1.0; all mismatches for IUB symbols score -0.9.
 

PROTEIN GAP PARAMETERS
----------------------

RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce
or increase the gap opening penalties at each position in the alignment or
sequence.  See the documentation for details.  As an example, positions that 
are rich in glycine are more likely to have an adjacent gap than positions that
are rich in valine.

HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within
a run (5 or more residues) of hydrophilic amino acids; these are likely to
be loop or random coil regions where gaps are more common.  The residues that 
are "considered" to be hydrophilic can be entered in HYDROPHILIC RESIDUES.

GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too close
to each other.  Gaps that are less than this distance apart are penalised more
than other gaps.  This does not prevent close gaps; it makes them less frequent,
promoting a block-like appearance of the alignment.

END GAP SEPARATION treats end gaps just like internal gaps for the purposes of
avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above).  If
you turn this off, end gaps will be ignored for this purpose.  This is useful
when you wish to align fragments where the end gaps are not biologically
meaningful.


OUTPUT FORMAT OPTIONS

Five output formats are offered.  You can choose more than one (or all 5 if
you wish).  

CLUSTAL format output is a self explanatory alignment format.  It shows the
sequences aligned in blocks.  It can be read in again at a later date to
(for example) calculate a phylogenetic tree or add a new sequence with a 
profile alignment.

GCG output can be used by any of the GCG programs that can work on multiple
alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN).  It is the same as the GCG
.msf format files (multiple sequence file); new in version 7 of GCG.

PHYLIP format output can be used for input to the PHYLIP package of Joe 
Felsenstein.  This is an extremely widely used package for doing every 
imaginable form of phylogenetic analysis (MUCH more than the the modest intro-
duction offered by this program).

NBRF/PIR:  this is the same as the standard PIR format with ONE ADDITION.  Gap
characters "-" are used to indicate the positions of gaps in the multiple 
alignment.   These files can be re-used as input in any part of clustal that
allows sequences (or alignments or profiles) to be read in.  

GDE:  this format is used by the GDE package of Steven Smith.


OUTPUT ORDER is used to control the order of the sequences in the output
alignments.  By default, it is the same as the input order.  This switch can
be used to make the order correspond to the order in which the sequences
were aligned (from the guide tree/dendrogram), thus automatically grouping 
closely related sequences.


>>HELP P <<
                   Profile and Structure Alignments
   
By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile 
alignments allow you to store alignments of your favourite sequences 
and add new sequences to them in small bunches at a time.  A profile 
is simply an alignment of one or more sequences (e.g. an alignment output 
file from Clustal X). Each input can be a single sequence.  One or both sets 
of input sequences may include secondary structure assignments or gap 
penalty masks to guide the alignment. 

Make sure PROFILE ALIGNMENT MODE is selected, using the switch directly above
the sequence display area. Then, use the ALIGNMENT menu to do profile and
secondary structure alignments.

The profiles can be in any of the allowed input formats with "-" characters
used to specify gaps (except for GCG/MSF where "." is used).

You have to load the 2 profiles by choosing FILE, LOAD PROFILE 1 and LOAD
LOAD PROFILE 2.  Then ALIGNMENT, ALIGN PROFILE 2 to PROFILE 1 will align the
2 profiles to each other. Secondary structure masks in either profile can be
used to guide the alignment. This option compares all the sequences in
profile 1 with all the sequences in profile 2 in order to build a guide tree
which will be used to calculate sequence weights, and select appropriate
alignment parameters for the final profile alignment.

You can skip the first stage (pairwise alignments; guide tree) by using an
old guide tree file (ALIGN PROFILES FROM TREE). 

The ALIGN SEQUENCES TO PROFILE 1 option will take the sequences in the second
profile and align them to the first profile, 1 at a time.  This is useful to
add some new sequences to an existing alignment, or to align a set of sequences
to a known structure.  In this case, the second profile need not be pre-aligned.

RESET GAPS BETWEEN ALIGNMENTS will remove any new gaps introduced into the
profiles during alignment if you wish to change the parameters and try again.
This only takes effect just before you do a second profile alignment. If you
turn this OFF, the new gaps are kept even if you do a second profile alignment.
This allows you to iterate the alignment gradually. Sometimes, the alignment is
improved by a second or third pass.

SAVE LOG FILE will write the alignment calculation scores to a file. The log
filename is the same as the input sequence filename, with an extension .log
appended.

The alignment parameters can be set using the ALIGNMENT PARAMETERS menu,
Pairwise Parameters, Multiple Parameters and Protein Gap Parameters options.
These are EXACTLY the same parameters as used by the general, automatic
multiple alignment procedure. The general multiple alignment procedure is
simply a series of profile alignments.  Carrying out a series of profile
alignments on larger and larger groups of sequences, allows you to
manually build up a complete alignment, if necessary editing intermediate
alignments.

SECONDARY STRUCTURE PARAMETERS allows you to set secondary structure options.
If a solved structure is available, it can be used to guide the alignment by
raising gap penalties within secondary structure elements, so that gaps will
preferentially be inserted into unstructured surface loop regions.
Alternatively, a user-specified gap penalty mask can be supplied for a similar
purpose.

A gap penalty mask is a series of numbers between 1 and 9, one per position in 
the alignment. Each number specifies how much the gap opening penalty is to be 
raised at that position (raised by multiplying the basic gap opening penalty
by the number) i.e. a mask figure of 1 at a position means no change
in gap opening penalty; a figure of 4 means that the gap opening penalty is
four times greater at that position, making gaps 4 times harder to open.

The format for gap penalty masks and secondary structure masks is explained
in the help under option 0 (secondary structure options).


SECONDARY STRUCTURE / GAP PENALTY MASKS
---------------------------------------

The use of secondary structure-based penalties has been shown to improve 
the accuracy of multiple alignment. Therefore Clustal X now allows gap penalty 
masks to be supplied with the input sequences. The masks work by raising gap 
penalties in specified regions (typically secondary structure elements) so that
gaps are preferentially opened in the less well conserved regions (typically 
surface loops).

The USE PROFILE 1/2 SECONDARY STRUCTURE / GAP PENALTY MASK options control
whether the input secondary structure information or gap penalty masks will be
used during the profile alignment.

The OUTPUT options control whether the secondary structure and gap penalty masks
should be included in the Clustal X output alignments. Showing both is useful for
understanding how the masks work. The secondary structure information is itself
useful in judging the alignment quality and in seeing how residue conservation
patterns vary with secondary structure. 

The HELIX and STRAND GAP PENALTY options provide the value for raising the gap
penalty at core Alpha Helical (A) and Beta Strand (B) residues. In CLUSTAL
format, capital residues denote the A and B core structure notation.  Basic gap
penalties are multiplied by the amount specified.

The LOOP GAP PENALTY option provides the value for the gap penalty in Loops.
By default this penalty is not raised. In CLUSTAL format, loops are specified
by "." in the secondary structure notation.

The SECONDARY STRUCTURE TERMINAL PENALTY provides the value for setting the gap
penalty at the ends of secondary structures. Ends of secondary structures are
observed to grow and/or shrink in related structures. Therefore by default these
are given intermediate values, lower than the core penalties. All secondary
structure read in as lower case in CLUSTAL format gets the reduced terminal
penalty.

The HELIX and STRAND TERMINAL PENALTY options specify the range of structure
termini for the intermediate penalties. In the alignment output, these are
indicated as lower case. For Alpha Helices, by default, the range spans the end
helical turn. For Beta Strands, the default range spans the end residue and the
adjacent loop residue, since sequence conservation often extends beyond the
actual H-bonded Beta Strand.

Clustal X can read the masks from SWISS-PROT, CLUSTAL or GDE format input files.
For many 3-D protein structures, secondary structure information is recorded in
the feature tables of SWISS-PROT database entries. You should always check that
the assignments are correct - some are quite inaccurate. Clustal X looks for
SWISS-PROT HELIX and STRAND assignments e.g.


FT   HELIX       100    115
FT   STRAND      118    119


The structure and penalty masks can also be read from CLUSTAL alignment format 
as comment lines beginning "!SS_" or "!GM_" e.g.

!SS_HBA_HUMA    ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA
!GM_HBA_HUMA    113337777777777333133377777777773333337333111111111333777777
HBA_HUMA        VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

Note that the mask itself is a set of numbers between 1 and 9 each of which is 
assigned to the residue(s) in the same column below. 

In GDE flat file format, the masks are specified as text and the names 
must begin with SS_ or GM_.

Either a structure or penalty mask or both may be used.  If both are included
in an alignment, the user will be asked which is to be used.

in the Clustal X output alignments. Showing both is useful for understanding
how the masks work. The secondary structure information is itself useful in 
judging the alignment quality and in seeing how residue conservation patterns 
vary with secondary structure. 


>>HELP T <<
                            Phylogenetic Trees

Before calculating a tree, you must have an ALIGNMENT in memory.  This can be
input using the FILE menu, LOAD SEQUENCES option or you should have just
carried out a full multiple alignment and the alignment is still in memory.
Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!!

The method used is the NJ (Neighbour Joining) method of Saitou and Nei.  First
you calculate distances (percent divergence) between all pairs of sequence from
a multiple alignment; second you apply the NJ method to the distance matrix.


To calculate a tree, use the DRAW TREE option.  This gives an UNROOTED tree and
all branch lengths.  The root of the tree can only be inferred by using an
outgroup (a sequence that you are certain branches at the outside of the tree
.... certain on biological grounds) OR if you assume a degree of constancy in
the 'molecular clock', you can place the root in the 'middle' of the tree
(roughly equidistant from all tips).


BOOTSTRAP TREE uses a method for deriving confidence values for the groupings in
a tree (first adapted for trees by Joe Felsenstein).   It involves making N
random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000);
drawing N trees (1 from each sample) and counting how many times each grouping
from the original tree occurs in the sample trees.  You can set N using the
NUMBER OF BOOTSTRAP TRIALS option in the BOOTSTRAP TREE window. In practice,
you should use a very large number of bootstrap replicates (1000 is recommended,
even if it means running the program for an hour on a slow microcomputer; on a
workstation it will be MUCH faster).
You can also supply a seed number for the random number generator here.
Different runs with the same seed will give the same answer. See the
documentation for more details.


EXCLUDE POSITIONS WITH GAPS?  With this option, any alignment positions
where ANY of the sequences have a gap will be ignored.  This means that 'like' 
will be compared to 'like' in all distances.  It also, automatically throws
away the most ambiguous parts of the alignment, which are concentrated around
gaps (usually).  The disadvantage is that you may throw away much of
the data if there are many gaps.  


CORRECT FOR MULTIPLE SUBSTITUTIONS?  For small divergence (say <10%) this
option makes no difference.  For greater divergence, this option corrects
for the fact that observed distances underestimate actual evolutionary dist-
ances.  This is because, as sequences diverge, more than one substitution will
happen at many sites.  However, you only see one difference when you look at the
present day sequences.  Therefore, this option has the effect of stretching
branch lengths in trees (especially long branches).  The corrections used here
(for DNA or proteins) are both due to Motoo Kimura.  See the documentation for
details.  

For VERY divergent sequences, the distances cannot be reliably
corrected.  You will be warned if this happens.  Even if none of the distances
in a data set exceed the reliable threshold, if you bootstrap the data, 
some of the bootstrap distances may randomly exceed the safe limit.  


SAVE LOG FILE will write the tree calculation scores to a file. The log
filename is the same as the input sequence filename, with an extension .log
appended.


OUTPUT FORMAT OPTIONS:  three different formats are allowed.  None of these
displays the tree visually.  You must make the tree yourself (on paper)
using the results OR get the PHYLIP package and use the tree drawing facilities
there.  (Get the PHYLIP package anyway if you are interested in trees).
 

TREE OUTPUT FORMAT OPTIONS
--------------------------

Three output formats are offered: 1) Clustal, 2) Phylip, 3) Just the distances.

None of these formats displays the results graphically.  To see a graphic
representation, get the PHYLIP package and use format 2) below.  It can be
imported into the PHYLIP programs RETREE, DRAWTREE and DRAWGRAM and displayed 
graphically.

1) Clustal format output.  
This format is verbose and lists all of the distances between the sequences
and the number of alignment positions used for each.   The tree is described
at the end of the file.  It lists the sequences that are joined at each 
alignment step and the branch lengths.  After two sequences are joined, it is 
referred to later as a NODE.  The number of a NODE is the number of the 
lowest sequence in that NODE.   

2) Phylip format output.
This format is the New Hampshire format, used by many phylogenetic analysis
packages.  It consists of a series of nested parentheses, describing the
branching order, with the sequence names and branch lengths.  It can
be used by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP
package to see the trees graphically.  This is the same format used during
multiple alignment for the guide trees.

3) The distances only.
This format just outputs a matrix of all the pairwise distances in a format
that can be used by the Phylip package.  It used to be useful when one
could not produce distances from protein sequences in the Phylip package but
is now redundant (Protdist of Phylip 3.5 now does this).


>>HELP C <<
                               Colors

Clustal X provides a versatile coloring scheme for the sequence alignment 
display. The sequences (or profiles) are colored automatically, when they are
loaded. Sequences can be colored either by assigning a color to specific
residues, or on the basis of an alignment consensus. In the latter case,
the alignment consensus is calculated automatically, and the residues in each
column are colored according to the consensus character assigned to that column.
In this way, you can choose to highlight, for example, conserved hydrophylic or
hydrophobic positions in the alignment.

The 'rules' used to color the alignment are specified in a COLOR PARAMETER
FILE. Clustal X automatically looks for a file called 'colprot.par' for protein
sequences or 'coldna.par' for DNA, in the current directory
(if your running under UNIX, it then looks in your home directory, and finally
in the directories in your PATH environment variable).

By default, if no color parameter file is found, protein sequences are colored
by residue as follows:

	Color			Residue Code

	ORANGE			GPST
	RED			HKR
	BLUE			FWY
	GREEN			ILMV

In the case of DNA sequences, the default colors are as follows:

	Color			Residue Code

	ORANGE			A
	RED			C
	BLUE			T
	GREEN			G


The default coloring system is to show residues as a colored character on a
white background. The BACKGROUND COLORING option shows the sequence residues 
using a black character on a colored background.

The DEFAULT COLOR PARAMETERS option looks first for a file called
color.par (as described above) and, if no file is found, uses the default
residue-specific colors.

You can specify your own coloring scheme by using the LOAD COLOR PARAMETER FILE
option. The format of the color parameter file is described below.

COLOR PARAMETER FILE

This file is divided into 3 sections:

1) the names and rgb values of the colors
2) the rules for calculating the consensus
3) the rules for assigning colors to the residues
 
An example file is given here.
--------------------------------------------------------------------
@rgbindex
RED          0.9 0.1 0.1
BLUE         0.1 0.1 0.9
GREEN        0.1 0.9 0.1
YELLOW       0.9 0.9 0.0

@consensus
% = 60% w:l:v:i:m:a:f:c:y:h:p
# = 80% w:l:v:i:m:a:f:c:y:h:p
- = 50% e:d
+ = 60% k:r
q = 50% q:e
p = 50% p
n = 50% n
t = 50% t:s

@color
g = RED
p = YELLOW
t = GREEN if t:%:#
n = GREEN if n
w = BLUE if %:#:p
k = RED if +

The first section is optional and is identified by the header @rgbindex. If this
section exists, each color used in the file must be named and the rgb values
specified (on a scale from 0 to 1). If the rgb index section is not found, the
following set of hard-coded colors will be used.
 
RED          0.9 0.1 0.1
BLUE         0.1 0.1 0.9
GREEN        0.1 0.9 0.1
ORANGE       0.9 0.7 0.3
CYAN         0.1 0.9 0.9
PINK         0.9 0.5 0.5
MAGENTA      0.9 0.1 0.9
YELLOW       0.9 0.9 0.0


The second section is optional and is identified by the header @consensus. It
defines how the consensus is calculated.
 
The format of each consensus parameter is:-
 
c = n% residue_list
 
        where
              c             is a character used to identify the parameter.
              n             is an integer value used as the percentage cutoff
                            point.
              residue_list  is a list of residues denoted by a single
                            character, delimited by a colon (:).
 
For example:   # = 60% w:l:v:i
will assign a consensus character # to any column in the alignment which
contains more than 60% of the residues w,l,v and i.
        
 
The third section is identified by the header @color, and defines how colors
are assigned to each residue in the alignment.
 
The color parameters can take one of two formats:

1) r = color
2) r = color if consensus_list
 
        where
              r             is a character used to denote a residue.
              color         is one of the colors in the GDE color lookup table.
              residue_list  is a list of residues denoted by a single
                            character, delimited by a colon (:).
 
Examples:
1) g = ORANGE will color all glycines ORANGE, regardless of the consensus.

2) w = BLUE if w:%:#
will color BLUE any tryptophan which is found in a column with a consensus of
w, % or #.
 

>>HELP Q <<
                       Alignment Quality

Clustal X provides an indication of the quality of an alignment by plotting
a 'conservation score' for each column of the alignment. A high score indicates
a well-conserved column; a low score indicates low conservation. The quality
curve is drawn below the sequences / profiles.


Low-Scoring Segments
--------------------

Unreliable regions in the alignment can be highlighted using the Low-Scoring
Segments option. A sequence-weighted profile is used to indicate any segments
in the sequences which score badly.  Because the profile calculation may take
some time, an option is provided to CALCULATE LOW-SCORING SEGMENTS. The 
segment display can then be toggled on or off without having to repeat the
time-consuming calculations.


MINIMUM LENGTH OF SEGMENTS: short segments (or even single residues) can be
hidden by increasing the minimum length of segments which will be displayed.


WEIGHT MATRIX: the scoring table which describes the similarity of each amino
acid to each other.  For DNA, a hard-coded matrix is used. The matrix is used
to calculate the sequence-weighted profile scores. A more stringent matrix
which only gives a high score to identities and the most favoured conservative
substitutions, may be more suitable when the sequences are closely related. 
For more divergent sequences, it is appropriate to use "softer" matrices which
give a high score to many other frequent substitutions. This option 
automatically recalculates the low-scoring segments.

A new matrix can be read from a file on disk, if the filename consists only
of lower case characters. The values in the new weight matrix should be
similarities and should be negative for infrequent substitutions.
 
INPUT FORMAT  The format used for a new matrix is the same as the BLAST program.
Any lines beginning with a # character are assumed to be comments. The first
non-comment line should contain a list of amino acids in any order, using the
1 letter code, followed by a * character. This should be followed by a square
matrix of scores, with one row and one column for each amino acid. The last
row and column of the matrix (corresponding to the * character) contain the
minimum score over the whole matrix.

For DNA, a single hard-coded matrix is used. This is the default scoring
matrix used by BESTFIT for the comparison of nucleic acid sequences. X's and
N's are treated as matches to any IUB ambiguity symbol. All matches score 1.0;
all mismatches for IUB symbols score -0.9.

HIDE LOW-SCORING SEGMENTS: The segment display can be toggled on or off. This
option does not recalculate the profile scores.


Residue Exceptions
------------------
An option is also available to highlight the residues which cause the low
scores in the quality curve. Residues which score exceptionally low are
highlighted by using a black character on a grey background if residue
coloring is selected, or using a white character on a black background if
background coloring is selected.
 
Highlighted residues are expected to occur at a moderate frequency in all the
sequences because of their steady divergence due to the natural processes of
evolution. The most divergent sequences are likely to have the most outliers.
However, the highlighted residues are especially useful in pointing to
sequence misalignments. Note that clustering of highlighted residues is a
strong indication of misalignment. This can arise due to various reasons, for
example:
 
        1. Partial or total misalignments caused by a failure in the
        alignment algorithm. Usually only in difficult alignment cases.
 
        2. Partial or total misalignments because at least one of the
        sequences in the given set is partly or completely unrelated to the
        other sequences. It is up to the user to check that the set of
        sequences are alignable.

        3. Frameshift translation errors in a protein sequence causing local
        mismatched regions to be heavily highlighted. These are surprisingly
        common in database entries. If suspected, a 3-frame translation of
        the source DNA needs to be examined.
 
Occasionally, highlighted residues may point to regions of some biological
significance.  This might happen for example if a protein alignment contains
a sequence which has acquired new functions relative to the main sequence
set. It is important to exclude other explanations, such as error or the
natural divergence of sequences, before invoking a biological explanation.


CALCULATION OF LOW-SCORING SEGMENTS
-----------------------------------

Suppose we have an alignment of m sequences of length n. Then, the alignment
can be written as:

        A11 A12 A13 .......... A1n
        A21 A22 A23 .......... A2n
        .
        .
        Am1 Am2 Am3 .......... Amn

We also have a residue comparison matrix of size R where Mij is the score for
aligning residue i with residue j.

We calculate sequence weights by building a neighbour-joining tree, in which
branch lengths are proportional to divergence. Summing the branches by
branch ownership provides the weights. See (Thompson et al., CABIOS, 10, 19
(1994) and Henikoff et al.,JMB, 243, 574 1994).

To find the low-scoring segments in a sequence Si, we build a weighted profile
of the remaining sequences in the alignment. Suppose we find residue r at 
position j in the sequence; then the score for the jth position in the sequence
is defined as

	Score(Si,j) = Profile(j,r)   where Profile(j,r) is the profile score
                                       for residue r at position j in the
                                       alignment.

These residue scores are summed along the sequence in both forward and backward
directions. Segments which score negatively in both directions are considered
as 'low-scoring' and will be highlighted in the alignment display.


CALCULATION OF QUALITY SCORES
-----------------------------

Suppose we have an alignment of m sequences of length n. Then, the alignment
can be written as:

        A11 A12 A13 .......... A1n
        A21 A22 A23 .......... A2n
        .
        .
        Am1 Am2 Am3 .......... Amn

We also have a residue comparison matrix of size R where Mij is the score for
aligning residue i with residue j.

We want to calculate a score for the conservation of the jth position in the
alignment.

To do this, we define an R-dimensional space with each residue in the comparison
matrix assigned to an axis in the space. Each sequence in the alignment can
then be assigned a point S in the space. S has R dimensions, and for sequence i,
the rth dimension is defined as:

	Sr =    MrAij     

We then calculate a consensus point for the jth position in the alignment. This
point P also has R dimensions, and the rth dimension is defined as:

	Pr = (   SUM   (Fij * Mir) ) / m
               1<=i<=R

where Fij is the frequency of residue i at position j in the alignment.

Now we can calculate the distance D between each sequence i and the consensus 
position P in the R-dimensional space.

	Di = SQRT   (   SUM   (Pr - Sr)(Pr - Sr) )
                      1<=i<=R


The conservation score for the jth position in the alignment is calculated as
the mean of the sequence distances Di.

The score is normalised by multiplying by the percentage of sequences which
have residues (and not gaps) at this position.

Residue Exceptions
------------------

The jth residue of the ith sequence is considered as an exception if the
distance Di of the sequence from the consensus point P is greater than 
Inter Quartile Range * Cutoff * 0.5 from the Median of all sequence distances.
The value used as a cutoff for displaying exceptions can be set from the
PARAMETERS option. A high cutoff value will only display very significant
exceptions; a low value will allow more, less significant exceptions to
be highlighted.

(NB. Sequences which contain gaps at this position are not included in the
exception calculation.)


ALIGNMENT QUALITY PARAMETERS
----------------------------

SCORE WEIGHT MATRIX: the scoring table which describes the similarity of 
each amino acid to each other.

For protein, there are three 'in-built' weight matrices offered: an identity
matrix  which gives a score of 10 to two identical amino acids and a score of
zero otherwise, the Blosum 45 matrix and the Gonnet PAM 250 matrix.

A new matrix can be read from a file on disk, if the filename consists only
of lower case characters. The values in the new weight matrix should be
similarities.  You can use negative as well as positive values if you wish,
although the matrix will be automatically adjusted to all positive scores.
 
INPUT FORMAT  The format used for a new matrix is the same as the BLAST program.
Any lines beginning with a # character are assumed to be comments. The first
non-comment line should contain a list of amino acids in any order, using the
1 letter code, followed by a * character. This should be followed by a square
matrix of scores, with one row and one column for each amino acid. The last
row and column of the matrix (corresponding to the * character) contain the
minimum score over the whole matrix.

For DNA, a single hard-coded matrix is used. This is the default scoring
matrix used by BESTFIT for the comparison of nucleic acid sequences. X's and
N's are treated as matches to any IUB ambiguity symbol. All matches score 1.0;
all mismatches for IUB symbols score -0.9.
 

SCORE SCALE: this is a scalar value from 1 to 10, which can be used to change
the scale of the quality score plot. 


RESIDUE EXCEPTION CUTOFF: this is a scalar value from 1 to 10, which can be
used to change the number of residue exceptions which are highlighted.


>>HELP 9 <<      Help for command line parameters
                DATA (sequences)

/INFILE=file.ext                             :input sequences.
/PROFILE1=file.ext  and  /PROFILE2=file.ext  :profiles (old alignment).

                VERBS (do things)

/OPTIONS	    :list the command line parameters
/HELP  or /CHECK    :outline the command line params.
/ALIGN              :do full multiple alignment 
/TREE               :calculate NJ tree.
/BOOTSTRAP(=n)      :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).

                PARAMETERS (set things)

***General settings:****
/INTERACTIVE :read command line, then enter normal interactive menus
/QUICKTREE   :use FAST algorithm for the alignment guide tree
/NEWTREE=    :file for new guide tree
/USETREE=    :file for old guide tree
/NEGATIVE    :protein alignment with negative values in matrix
/OUTFILE=    :sequence alignment file name
/OUTPUT=     :GCG, GDE, PHYLIP or PIR
/OUTORDER=   :INPUT or ALIGNED
/CASE        :LOWER or UPPER (for GDE output only)

***Fast Pairwise Alignments:***
/KTUP=n      :word size                  /TOPDIAGS=n  :number of best diags.
/WINDOW=n    :window around best diags.  /PAIRGAP=n   :gap penalty
/SCORE       :PERCENT or ABSOLUTE

***Slow Pairwise Alignments:***
/PWMATRIX=   :BLOSUM, PAM, GONNET, ID or filename
/PWGAPOPEN=f :gap opening penalty        /PWGAPEXT=f  :gap opening penalty

***Multiple Alignments:***
/MATRIX=     :BLOSUM, PAM, GONNET, ID or filename
/GAPOPEN=f   :gap opening penalty        /GAPEXT=f  :gap extension penalty
/ENDGAPS     :no end gap separation pen. /GAPDIST=n   :gap separation pen. range
/NOPGAP      :residue-specific gaps off  /NOHGAP    :hydrophilic gaps off
/HGAPRESIDUES= :list hydrophilic res.    /MAXDIV=n    :% ident. for delay
/TYPE=       :PROTEIN or DNA             /TRANSITIONS :transitions NOT weighted.

***Profile Alignments:***
/PROFILE     :Merge two alignments by profile alignment
/SEQUENCES   :Sequentially add profile2 sequences to profile1 alignment

***Structure Alignments:***
/NOSECSTR1     :do not use secondary structure/gap penalty mask for profile 1 
/NOSECSTR2     :do not use secondary structure/gap penalty mask for profile 2
/SECSTROUT=    :STRUCTURE or MASK or BOTH or NONE  output in alignment file
/HELIXGAP=n    :gap penalty for helix core residues 
/STRANDGAP=n   :gap penalty for strand core residues
/LOOPGAP=n     :gap penalty for loop regions
/TERMINALGAP=n :gap penalty for structure termini
/HELIXENDIN=n  :number of residues inside helix to be treated as terminal
/HELIXENDOUT=n :number of residues outside helix to be treated as terminal
/STRANDENDIN=n :number of residues inside strand to be treated as terminal
/STRANDENDOUT=n:number of residues outside strand to be treated as terminal 

***Trees:***                             /SEED=n    :seed number for bootstraps.
/KIMURA      :use Kimura's correction.   /TOSSGAPS  :ignore positions with gaps.