PREPARING DATA FOR MAPMAKER VERSION 3.0 AND MAPMAKER/QTL VERSION 1.1
(c) Copyright 1992 Whitehead Institute for Biomedical Research

Data files are fully compatible between MAPMAKER Version 3.0 and MAPMAKER/QTL 
version 1.1, providing a unified package for mapping both genetic markers as 
well as factors controlling quantitative traits in the same populations. 
MAPMAKER Version 3.0 can analyze data derived from progeny of several types of 
crosses, including:

    F2 backcross (e.g. BC1)
    F2 intercross
    F3 intercross (by self-mating)
    Recombinant Inbred Lines (by self or sib-mating)

There are also other types of crosses for which MAPMAKER and MAPMAKER/QTL can 
be used, because the genetic model for the cross is identical to that of one of 
the above simple crosses. For example, F2 testcrosses and F1 haploid data can 
be used, as described below.

Unlike MAPMAKER however, MAPMAKER/QTL can currently only work with F2 
intercross and backcross data.  The two programs handle loading and preparing 
data files in similar ways, and share files which hold intermediate results. 

To get your data into MAPMAKER and MAPMAKER/QTL, the data must first be placed 
into a 'raw' file in an appropriate format. You can either maintain your data 
in this format, or instead extract it from your working database (such as a 
spreadsheet program). MAPMAKER (not MAPMAKER/QTL) must then be used to 
'prepare' these files into a processed form ready for analysis, these processed 
files are then loadable by either MAPMAKER or MAPMAKER/QTL.  These issues are 
the topic of the next sections.


SETTING UP A RAW DATA FILE

Raw files are flat ASCII text files which may be generated in many ways, 
including: (i) any simple text editor, such as DOS Edit, A/UX's Text Editor, or 
Sun's OpenWindows Text Editor; (ii) a word-processor which can export text-only 
files; (iii) a spreadsheet or flat-file database which can export "Text Only" 
files, such as Excel, Lotus 1-2-3, or FileMaker; or (iv) a program which you 
write yourself. The raw data does not need to be stored on the same machine as 
that you run MAPMAKER on, although you obviously will need some way of 
transferring the data (Bear in mind that text-only formats are very slightly 
different on Unix, DOS, and Macintosh -- your software should convert the file 
appropriately as it is trasferred. Ask your computer support people for 
details.)

As a general note, MAPMAKER attempts to be very lienient about how you separate 
items in a data file (e.g. spaces, tabs, or sometimes line breaks), and is 
generally insenitive to extra spaces, uppercase-lowercase distinctions, and 
(after the top two lines) blank lines.  However, it is still possible to format 
ia file in such a way as it confuses MAPMAKER -- if you have trouble, try to 
make your MAPMAKER file look more like the sample file, included.

The very first line of your raw data file should read like:

    data type xxxx

where  xxxx is one of the allowed data types, either:

    f2 intercross
    f2 backcross
    f3 self
    ri self
    ri sib

The second line of the raw file should contain a list of three numbers, 
separated by spaces, such as:

    46 362 2

The first of these values indicates the number of progeny for which data are 
included in the file (in this case, 46). The second indicates the number of 
genetic loci for which data are supplied (362). The third indicates the number 
of quantitative traits in the data set (here 2, although this may be zero, of 
course).

Additional information may be optionally supplied at the end of this line. In 
particular, you may specify the coding scheme you use for genotypes. By 
default, the codes used for F2 backcross (a.k.a. BC1) data are:

    'A'    Homozygote for the recurrent parent genotype.
    'H'    Heterozygote.
    '-'    Missing data for the individual at this locus.

For F2 intercross data, the default codes are:

    'A'    Homozygote for the allele from parental strain a of this locus.
    'B'    Homozygote for the allele from parental strain b of this locus.
    'H'    Heterozygote carrying both alleles a  and b.
    'C'    Not a homozygote for allele a  (either bb  or ab  genotype.)
    'D'    Not a homozygote for allele b  (either aa  or ab  genotype.)
    '-'    Missing data for the individual at this locus

For RI data, the default codes are:

    'A'    Homozygote for parental genotype a.
    'B'    Homozygote for parental genotype b.
    '-'    Missing data for the individual (or line) at this locus.

Also by default, MAPMAKER will match genotype characters in a case-insensitive 
manner (that is 'a' and 'A' indicate the same genotypes).

Howver, you can tell MAPMAKER to use whatever conventions you like, so long as 
you use the same conventions for the entire data file. First off, if you follow 
the numbers on the second line with the word "case", then MAPMAKER will match 
genotype characters in a case sensitive manner (that is 'a' and 'A' can be used 
to indicate different genotypes). For example:

    46 362 2 case

If you do not wish to use case-sensitive genotypes, do not include the word 
"case".

To specify the coding scheme itself, include on the end of the above line the 
word "symbols" followed by the coding scheme you wish to use, defined in terms 
of the coding scheme above. For example, if you wish to use the following 
scheme with an RI data set:

    '1'    Homozygote for parental genotype a.
    '2'    Homozygote for parental genotype b.
    '0'    Missing data for the individual (or line) at this locus.

then you would use a second line like:

    46 362 2 symbols 1=A 2=B 0=-

Note that when interpreting this line, MAPMAKER is in fact quite finickey about 
spaces and case distinctions (in order to keep MAPMAKER from ever 
misunderstanding exactly what you mean). In particular, NO SPACES should 
surround the "=" signs.

To use with a backross data set the scheme:

    'a'    Homozygote for parental genotype a.
    'A'    Heterozygote.
    '-'    Missing data for the individual (or line) at this locus.

you should use a line like:

    46 362 2 case symbols a=A A=H

The main restriction on coding schemes are that the only allowed symbols are 
letters, numbers, and the characters '-' and '+'. 

After the first two header lines, the raw file should then present the genetic 
locus data, in the following simple format:  For each locus, you list (1) the 
name of the locus, preceded by an asterisk ("*"); (2) one or more spaces (or 
tabs etc.); and (3) the genotypic data for all individuals, in order. For 
example:

    *locus1   BA-HHHAAABBB-HHAA

would provide data for a locus named "locus1" with individual #1 having the B 
genotype, individual #2 having the A genotype, and so forth. Data for each new 
locus should begin on a new line (with blank lines allowed), although the 
genetic data for any one locus may be "broken" by any number of spaces, tabs, 
and line breaks. This means that, among other things, tab-delimited-text files 
(such as those often exported by spreadsheet programs) will work well, for 
example:

    *L2  B    A    -    H    H    H    A    A    A    B    B    B    -    H

There is a system-dependednt maximum line length, although it is fairly large 
(at least 1,000 characters, where a tab counts as one character).

Locus names should be kept to at most 8 characters, and must be limited to 
alphabetic and numeric characters, along with the underscore character ('_') 
and periods ('.'). No other characters are allowed (although any dashes in 
locus names ('-') will be converted to underscores). Locus names must start 
with a alphabetic character (so that they are not confused with locus numbers 
in MAPMAKER sequences).

Any quantitative trait data should come after the genetic locus data. These 
data follow a similar format, except that the trait values for each individual 
must be separated by at least one space, tab, or line break. A dash ('-') alone 
indicates missing data. For example:

    *weight    6.3 7.7 8.0 6.2 8.6 - 7.5 9.0 5.5 - - 8.4 7.7 7.4 6.9 -

would correspond to a trait named "weight", for which individual #1 has a value 
of 6.3, individual #2 has a value of 7.7, and so on. The sixth individual is 
missing data for this trait (and will be ignored for all analyses involving 
these trait data). As for the genotypes, a new trait should begin on a new 
line, and line breaks are allowed. Tab-delimted-text files work well here too.

Traits may also be specified as functions of other existing trait data. For 
example:

    *weight1    6.3  7.7  8.0  6.2  8.6  6.9  7.5  9.0 
    *weight2    6.7  7.9  7.5  6.8  8.0  7.3  7.5  9.5 
    *mean= (weight1 + weight2)/2

The format of these equations is described under the "make trait" command. Such 
traits must be included in the number of traits indicated on the file's second 
line.

Note that genetic maps (particularly for MAPMAKER/QTL) are no longer included 
in the raw file, as they were with MAPMAKER Version 2.0. Instead, use a ".prep" 
initialization file, described below.

Finally, note that comments may be inserted on any line starting with a number 
sign character ("#"). 

An example of a complete raw file follows:

    data type f2 intercross
    20 5 2
    # Joe's tiny data set, 10/21 version.

    *locus1  BBBHH-AAABBBHHH-AABA
    *locus2  AB-ABHABHAB-ABHABHBH
    *locus3  ABBAHHHBHABHABHBBHH-
    # Locus3 may be mis-scored in individual 12!
    *locus4  ABHABAAAHAB-ABHABHHB
    *locus5  ABHABHAA-ABHABHAHHHB

    *trait1 6.3 7.7 8.0 6.2 8.8 6.2 4.1 6.5 5.4 7.3
            8.7 9.0 5.2 6.8 7.2 7.1 7.6 8.3 8.1 7.5
    *trait2 5.5 5.5 5.5 4.5 4.5 4.5 3.5 3.5 3.5  - 
            5.5 5.5 4.5 4.5 4.5 3.5 5.2 6.8 7.2 7.1


PREPARING A RAW DATA FILE FOR ANALYSIS

Once your data are in the raw file format, it is easy to process them into a 
form usable by MAPMAKER Version 3.0 and MAPMAKER/QTL 1.1.  In this version of 
the programs, you must do this processing using MAPMAKER's "prepare data" 
command (you can not presently prepare a raw file using MAPMAKER/QTL). 

Simply put, the "prepare data" command loads the information in your raw data 
file into MAPMAKER. Unless told otherwise (see below), MAPMAKER then writes 
some new files which are in a slightly different format (you should not ever 
modify these files, and thus you should not be concerned about precisely what 
this format is.)  Your raw file remains unaltered and should be saved as a 
backup copy of your data. These new files will serve as the working data set 
for MAPMAKER and MAPMAKER/QTL -- both programs will read and write these files 
repeatedly to keep the state of your analyses between sessions.

In the process of preparing data, MAPMAKER loads the new data set into its 
memory, which is then ready for analysis (earlier versions of MAPMAKER required 
you to separately load a data file after it is prepared, this is no longer the 
case.)

The first files generated get the extensions ".data", ".maps", and ".traits" 
(truncated on DOS systems to ".dat", ".map", and ".tra"). The ".data" file 
contains the genetic locus data. The ".maps" file contains saved mapping 
results along with some MAPMAKER specific information.  The ".traits" file 
contains the quantitative trait data and several MAPMAKER/QTL specific values. 
Other files may also be created while you use MAPMAKER and MAPMAKER/QTL -- 
these include ".2pt" and ".3pt" files containing MAPMAKER's two-point and 
three-point data respectively, and a ".qtls" file (".qtl" on DOS) containing 
save results from MAPMAKER/QTL.

To prepare a raw file, simply start up MAPMAKER, and type the command:

    prepare data xxxx

where xxxx is the name of the raw file (with its extension, if it has one). We 
recommend that raw files use the extension ".raw", although this is not 
required. For example:

    prepare data mydata.raw

If you specify a directory for the file name, the prepared files will be placed 
in that directory also. 

You may now start analyzing your data using any of MAPMAKER's commands. When 
you later quit MAPMAKER (or use the "save" command), the files will be updated.  
Later, you may resume your analyses by restarting MAPMAKER and re-loading these 
files using the "load data" command. For example:

    load data mydata


USING AN INITIALIZATION (.PREP) FILE

Whenever you issue the "prepare data" command, MAPMAKER looks for a file with 
the same name as the raw data file and the extension ".prep" (on UNIX, 
truncated to ".pre" on DOS). If this file is present, it is assumed to contain 
MAPMAKER commands, which are automatically executed after the data are 
prepared. These "initialization files" serve as a useful way to setup MAPMAKER 
in the appropriate state for working with a particular data set. With an 
initialization file, every time that data set is prepared (e.g. if you change 
genotype data), it is relatively easy to start again where you left off. 

When a initialization file is not found, MAPMAKER's default initialization 
action is simply to save the working data files (as if the "save data" command 
had been typed).  When a initialization file is found, MAPMAKER executes these 
commands INSTEAD.  Thus, if you want MAPMAKER to save the files, you should end 
your initialization file with a "save data" command.

Typical actions in an initialization file might be to:
    - set various MAPMAKER options or parameters
    - declare the names of chromosomes, classes, anchor loci, etc
    - set the framework orders of chromosomes, particularly for MAPMAKER/QTL
    - precompute two-point data and find linkage groups
    - set various named sequences

To load a data set into MAPMAKER/QTL, you need to provide "framework" maps for 
any chromosome you wish to scan. When you know a map order for some 
chromosomes, it is often convenient to place this in a initialization file in 
order to quickly have a data set ready for MAPMAKER/QTL. 

If you wish MAPMAKER to calculate the map distances, you can do this with 
commands like:

    make chromosome chrom2
    sequence R45S TG165 TG175 CD35 TG93 CD66 TG50B
    framework chrom2

To provide map distances yourself, use a sequence with fixed distances using 
MAPMAKER's "=" syntax:

    seq R45S =21.9 TG165 =20.7 TG175 =4.4 CD35 =13.2 TG93 =7.3 CD66 =13.6 TG50B

See the discussion of the "sequence" command in the MAPMAKER reference manual 
for details. Note that the above map distances  would be assumed to be in 
centimorgans, using the specified "centimorgan function" (by default, the 
Haldane function). Naturally, you do not NEED to declare the map orders in an 
initialization file to use MAPMAKER/QTL -- you may issue the same commands 
interactively before saving the data and then run MAPMAKLER/QTL.

A sample ".prep" file might be:

    units cm
    cent func kosambi
    make chrom chrom1 chrom2 chrom3
    seq 1
    anchor chrom1
    seq 4 
    anchor chrom2
    seq 13
    anchor chrom3
    error det on
    seq all
    error prob 0.5
    two point
    assign
    seq R45S TG165 TG175 CD35 TG93 CD66 TG50B
    frame chrom2
    save

(note the use of command abbreviations here). Another exmaple of a  ".prep" 
file is supplied with the sample data files included with MAPMAKER.


USING OTHER TYPES OF CROSSES AND MARKERS

MAPMAKER's linkage analysis mechanism is quite general, and in fact can analyze 
many varied sorts of data.

Fort example, one frequently asked question concerns multibanded markers, such 
as cDNA RFLPs and RAPDs, particularly in an F2 intercross. In this case, each 
band of the marker can be considered a dominant trait, and can be entered using 
the C and D notation described above. However, some of the bands may be 
allelic, in which case you would gain much power by recoding them as a 
codominant (A/B/H) marker.  This can be done two ways: either (1) enter each 
band as a +/- marker, and perform an initial linkage analysis looking for 
markers that are recombinationally unseparated and which map together.  Recode 
these as a codominant locus.  Alternatively (2), you may be able to use 
MAPMAKER's "join haplotypes" feature, discussed in the referencs manual.

To enter data for other types of crosses, you need to determine whether the 
cross genetically resembles one MAPMAKER already understands, in terms of the 
underlying genetic model, or whether it one of MAPMAKER's models will provide a 
reasonble interpretation (modulo some scaling of likelihoods and distances). 

As a simple example, consider an F2 testcross, which is much like a backcross 
except that we have:

    (a|a x b|b)  x  c|c

in which case the observable F2 genotypes are a|c and b|c. To code this as a 
backcross, simply designate one parent's genotype (a or b) as 'A', the other as 
'H', and enter the data with this coding in the normal way.  NAPMAKER's 
underlying genetic model will be exactly correct and the LOD scores and 
distances will be correct. Be careful however with +/- markers (such as RAPDs) 
to get the parental genotype assignments (a allele vs. b allele) correct!

As another example, imagine F1 haploids of an outbred species, again encoding 
the data as a simple backcross. For example, if we cross:

    a|b  x  c|d

then the observable haploid genotypes at any locus are: a, b, c, and d.  If 
linkage phase is known (that is if we know which chromosome a and b are on, and 
which c and d are on, and we can keep this assignment consistent accross the 
entire data set), then the case is easy: Arbitrarily designate one backcross 
class (say 'A') as "a or c", the other ('H') as "b or d", and enter the data 
with this coding in the normal way -- NAPMAKER's underlying genetic model again 
will be exactly correct and the LOD scores and distances will be correct. 

Problems arise when true genotypic classes cannot be distinguished, or 
(equivalently) when linkage-phase is not known beforehand, as may be the case 
with RAPDs and similar markers. In such cases, your only recourse may be to 
perform a segragation analysis on the observed genotypes to determine probable 
phase assignments, and then code the data as phase known.  Other methods may be 
available: contact us for details.


COMPATIBILITY WITH PREVIOUS VERSIONS OF MAPMAKER

Users of MAPMAKER version 2.0 (a.k.a. 1.9) will have little trouble getting 
their data into MAPMAKER version 3.0, because the file formats are virtually 
identical. The only slight difference is in the format of the second line of 
the data file header, as described above.

Users of MAPMAKER/QTL version 1.0 (a.k.a. 0.9) however, will have to slightly 
modify their files in the way that chromosome orders and maps are included. The 
format described above makes this very convenient for the majority of users who 
will compute maps in MAPMAKER and then load these results into MAPMAKER/QTL.

Users of MAPMAKER Version 1.0, or MAPMAKER for Macintosh (a.k.a. MAPMAKER-II) 
will need to do a little more work, because of both the slightly different 
header and the required asterixes before locus names, as described above. F2 
backcross data sets, entered into old versions of MAPMAKER as intercrosses, 
should in fact be analyzed as true backcrosses in the new version (luckily, 
MAPMAKER 3.0's ability to use arbitrary genotype coding schemes, described 
above, insures that you will NOT have to retype all of your genotype data into 
MAPMAKER.)


Ver 3b: S. Lincoln 12/92