Instructions for reticulate, a program for calculating Compatibility Matrices Copyright 1995 by Ingrid Jakobsen, Australian National University Permission is granted to use, copy, modify and distribute this program and accompanying documentation provided no fee is charged, and this copyright notice is not removed. No representations are made about the suitability of this software for any purpose. email ingrid@helios.anu.edu.au Note: this file includes ASCII diagrams. You will need to display/print it using a non-proportional font (Courier) to get full benefit of the diagrams. Contents ******** List of supplied files Advice on compiling the program successfully Modifying the program for systems that don't have X11 graphics How to run the program, including description of data input format Notes on the various options available Descriptions of the randomisation functions References Supplied Files ************** Instructions This file reticulate.c Main body of program XPlot.c X graphics functions XPlot.h Headers for X graphics Makefile Makefile gamglob.gde Gamma globin sequences aligned in gde flat file format Compilation of program ********************** You need the four files: reticulate.c XPlot.c XPlot.h Makefile. Edit the makefile to suit your local system, most importantly there must be a line (without the #) where CFLAGS points at the local X11 include files. If necessary, edit the #define statements at the beginning of reticulate.c. It may be required to set the papersize to suit the printer you intend to send output to. The program currently sets up the page to be the smaller of A4 and US letter in each direction, which should allow printing on most printers, but you may wish to set it to the exact local size. In addition, the margins the program will print out to are quite narrow, and may need to be widened. All measurements are in points, where 72 points = 1 inch. Unfortunately points do not convert nicely to cm, since 1 inch = 2.54 cm. Type make at the prompt. The compiled program is called reticulate. There may be unknown difficulties compiling this program on other systems, I only have experience with Sun OS 4.1.3 and Silicon Graphics IRIX. I have tried very hard to only use standard ANSI C. Compiling the program without X11 ********************************* You can compile and run a modified version of reticulate. Unfortunately, you won't be able to see the matrices on screen. You will only need reticulate.c Edit it, to remove the following lines: (In the headers to the program) #include #include #include "XPlot.h" (In the main body of the program, the interactive section) if (option == 'd') /* Display the matrix on screen */ { printf("\n\nThe matrix will be displayed in a separate window."); printf(" When you have examined\nit, click the mouse once in the "); printf("window to continue the program.\n"); RetCode = ShowXPlot(NumSeq, SeqLen, codonnum, codon_pos, site_matrix[into]); if (RetCode) printf("Unable to display matrix. \n"); } (edit the immediately following line to remove the "else") else if (option == 'p') /* postscript output */ (becomes:) if (option == 'p') /* postscript output */ (You can also remove this line for niceness:) printf(" (d) Display the current matrix on screen (X display)\n"); (You may also wish to change the definition of 'option' in main() , to whatever suits you, 'p' is probably best) char option = 'd'; (becomes:) char option = 'p'; The program should now compile with a standard C compiler, for example gcc reticulate.c -o reticulate Alternatively, you may have a local graphics package you can plug in. Instead of removing the above lines, you need to edit them to include the package, however it works. Basically the idea is to generate a square window, the size of the number of cladistically informative sites, and make all co-ordinates corresponding to the letter 'm' in the compatibility matrix black, the rest should be left white. See the ShowXPlot function in XPlot.c Running the program, data input formats *************************************** Type reticulate at the prompt. The program will ask for the name of the input file, followed by various options. The most common problem we encounter when running the program is to be in a different directory than the input data file. There should be at least 4 sequences to be analysed, and they should previously have been aligned. If the alignment isn't suitable for phylogenetic analysis, it's not suitable for a compatibility matrix. The format of the data should be FASTA or related formats, eg >Sequence1 ACTACGGGATAACAATT AGATACAGATACGAGAG AGTCTCTC >Sequence2 ACTACGCGATAATGATT AG--ACAGATAGCAGAG ?GTCCCTC >Sequence3 -----GGGATTACGATG AG--ACAG?TACCAGAG AGTCTCTT etc. Each sequence appears as a complete block, one after the other. The first line of each sequence has a marker character, followed by the sequence name and any other comments you like, but no actual sequence data. Only a fixed number of characters are read in and saved as the sequence name. This is currently set to 15, but can be altered in the source code. The marker character can be any non-alphanumeric character. For example > # ; % are all acceptable. The same character should be used to mark all sequences. It should be the first character on the name line. For example, check that there are no spaces in front of it. Avoid using ? or - as a marker character. They are used to mark unknowns and gaps in the sequences themselves. Gaps should be marked with '-' characters, as many as the length of the gap, to maintain the correct alignment. However, if any sequence is shorter than the others, the final gap does not need to be marked. It will be filled automatically. Unknown characters should be marked with a '?', or alternatively one of the characters 'N' (for DNA) or 'X' may be used. The program will ask you if any 'N' or 'X' characters it encounters are to be taken as unknown and if you reply yes, will code them internally as '?'. It is only possible to set one of 'N' or 'X' to be recorded as unknown. There can be any amount of whitespace in the sequences, it will be ignored. Make sure that for example no gaps in the alignment have been marked using spaces. Related formats which are also accepted include GDE flat file format: #Sequence1 ACTACGGGATAACAATT AGATACAGATACGAGAG AGTCTCTC #Sequence2 ACTACGCGATAATGATT AG--ACAGATAGCAGAG ?GTCCCTC #Sequence3(5) GGGATTACGATGAG--A CAG?TACCAGAGAGTCT CTT etc In GDE flat file format, a number in parentheses on the name line is interpreted as an initial number of gaps at the front of the sequence (i.e. an offset). The program asks if numbers in parentheses on the name line should be interpreted as initial gaps. Note that answering this 'no' when there are gaps marked in this way, will result in the wrong alignment being read in, and possibly vice versa if you happen to have any comments in brackets on the sequence name line. Calculating the matrix ********************** The program will read in the data file and then examine the file for cladistically informative sites. The options for dealing with sites with more than two characters (non-binary sites) are: to ignore those sites altogether; to leave them as multiple-character sites; to convert DNA into transversion sites, or to manually decide binary partitions for each such site in turn. In the manual option, the number of sequences with each character state are presented for each site. The options are to ignore the site, convert to binary, or to leave as is. When converting to a binary site, enter the characters that you would like grouped into the first of the two groups. As a check, the binary characters and counts are listed before the next site is presented. When the DNA transversion site option is invoked, non-binary sites are grouped into 'Y' (Y or C or T/U) and 'R' (R or A or G). If any other character is found, the program moves onto the manual option above to resolve the site. At this point, the informative sites can be saved into a text file, with the character states in the form the program used to determine compatibility - for example if the transversion option was set for DNA, non-binary sites will appear as 'Y' and 'R' characters rather than nucleotides. For determining sites and calculating compatibility, sequences with gaps or unknowns will be ignored. If for example a gap needs to be considered for determining compatibility, one of the gap characters should be replaced by a letter (or number) in all the sequences sharing the gap. Compatibility is calculated by comparing two informative sites. All distinct combinations of characters at the two sites are found. For example, for the five sequences below, the program had previously identified three informative sites (they are repeated to the right of the 2 6 8 2 6 8 alignment) then, when comparing sites 2 and 6, A ACGAACGTACGT C C T the pairs (C,C) (A,C) (A,T) are found, for B ACGTACGGACGT C C G 2 and 8 they are (C,T) (C,G), (A,T) (A,G) and C AAGTACGTACGT A C T for 6 and 8, (C,T) (C,G) (T,G). Note that the D AAGTATGGACGC A T G order within the pairs is important. To find out E AAGTATGGACGT A T G if two sites are compatible, any pairs that have a unique character in either position is eliminated. So for sites 2 and 6 we can eliminate (C,C) since no other pair has a C in the first position, and (A,T) since no other pair has a T in the second position. The process is repeated and (A,C) can now be eliminated, since it is the only pair left. Since all pairs are eliminated sites 2 and 6 are compatible. On the other hand, for sites 2 and 8, we cannot eliminate any pair, as both C and A in the first position occurs in two pairs, and similarly T and G in the second position. This means sites 2 and 8 are incompatible. For sites 6 and 8, (C,T) and (T,G) can be eliminated - the T is unique in each position. Thus 6 and 8 are also compatible. 2 6 8 The comparisons of sites are displayed in a matrix format. 2 \ . M The compatible comparisons are coloured white, the incompatible 6 . \ . are black. On the left, white squares are represented by . 8 M . \ and black by M . The diagonal (self with self) is \ . The compatibility matrix is calculated as the pairwise compatibility of all informative sites and displayed in a separate X11 window on the screen. Click the mouse anywhere in the matrix window to continue the program. Options ******* The program ends with an interactive options menu. The options are to display the matrix (using X11), save the matrix in postscript for printing, and a randomisation option which allows you to find some statistics about the matrix, and the significance of any clustering you may have observed. The random matrix option calculates the matrix's neighbour similarity score and compatibility of subregions of the matrix if desired. Random matrices are then calculated, as many as desired, provided memory can be allocated. The summary statistics for the random matrices are printed in a text file. The final random matrix is stored. In the main options menu, it is possible to change "current matrix" to the random matrix, and so display or save the random matrix. Obviously, by calling the randomisation option several times, you can view a number of different random matrices of the original data. However, the randomisation is done based on the "current matrix", so it is necessary to return to the original matrix before running the randomisation option again, if you would like the statistics to make any sense. Description of randomisation algorithm and Neighbour similarity score. ********************************************************************** The neighbour similarity score is calculated by examining each internal edge in the (triangular half) matrix and scoring those edges separating squares of the same colour. Thus most squares are compared to four other squares, while those on the edges and corners are only compared to three, two respectively. The diagonal entries are included and treated as compatible. However since this convention is maintained when scoring the random matrices, it should be of little practical importance. The compatibility score of the overall matrix or of subregions about the diagonal is the fraction of (off-diagonal triangle) entries that are compatible in that region. Thus the compatibility of a region with only one site is undefined, two sites have only one entry, three sites have three, four sites have six, etc. Again the value is calculated identically for the observed and random matrices. The matrix is randomised by shuffling the sites rather than the individual squares. It is a shuffle rather than a bootstrap - each original site appears once in each random matrix. The randomisation by sites ensures that the individual character of sites is retained, for example hypermutating sites remain as "black" rows and columns in the matrix. Preliminary testing had shown that randomisation of squares rather than sites made matrices appear much more statistically significant - the test was of the degree of compatibility of individual sites rather than their order along the sequence. Notes on pairwise comparisons and non-binary sites ************************************************** The current program only calculates compatibility for two sites at a time, however it is possible to do compatibility for multiple sites at once. It wasn't implemented here as it is difficult to display and interpret multi- dimensional matrices. As you would have noticed in the example, it is possible for site A to be compatible with B and B with C but that does not mean A is compatible with C. Somewhat similarly it is possible for A to be compatible with B, B with C and A with C, but all three sites are *not* compatible when considered as a triplet. This problem however only occurs when sites have more than two distinct character states. This is why the program allows sites with more character states to be ignored altogether, or converted into binary sites. If you are worried about a group of sites (all of which have more than two character states) being pairwise compatible but not overall compatible, the algorithm for determining compatibility can be extended to triplets etc. It gets progressively more messy however. "Sets" (the generic of the pairs) are not eliminated when unique in one position, but in some cases replaced by new sets. For the above example (which we already know to be incompatible since 2 and 8 are incompatible) the distinct triplets are (C,C,T) (C,C,G) (A,C,T) (A,T,G). In (A,T,G), T is unique in the middle position. The triplet is replaced by any (A,*,G) we can infer from the other sets. (A,C,T) matches in the first position generating (A,C,G) and similarly (C,C,G) matches in the last position, also generating (A,C,G). However, only one is included as each distinct set is only represented once, as before. Now (C,C,T) (C,C,G) (A,C,T) and (A,C,G) are left and the three sites are incompatible since every character in every position appears in at least one set. The fact that the second position (site 6) is now only C, confirms that the incompatibility in this triplet is really between 2 and 8. The pairwise algorithm is really a special case of the description above, as the pair(s) that replaces an eliminated pair is already present in the list. References ********** Sneath, P.H.A., Sackin,M.J. & Ambler,R.P. Detecting evolutionary incompatibilities from protein sequences Syst. Zool. 24: 311-332 (1975) Jakobsen, I.B. & Easteal, S. A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences submitted to CABIOS (1995) **********************************************************