Mase(File Formats) UNIX Programmer's Manual Mase(File Formats) NAME _M_A_S_E File Formats: _M_A_T_R_I_X, _C_H_A_R_A_C_T_E_R-_S_E_T, _O_U_T_P_U_T _S_T_R_I_N_G_S, and _S_E_Q_U_E_N_C_E _D_A_T_A _F_I_L_E_S. SYNOPSIS It is obvious that the ``data'' that _M_A_S_E works on needs to be stored in a file. Additionally, to main- tain flexibility, _M_A_S_E reads most of it's behavioral characteristics from files external to _M_A_S_E itself. This allows one version of the program to behave in different ways for different people, and allows indivi- duals to easily ``tune'' _M_A_S_E to best suit their needs. These data need have some structure; this document defines that structure. In general, if one desires to use a structure other than those defined here, or would find an extension useful, only a single routine need be modified (if one desires to both read and write in a different format, that may well require modifying two different rou- tines). DESCRIPTION _S_E_Q_U_E_N_C_E _D_A_T_A _F_I_L_E_S The sequence data files use a simplified _I_n_t_e_l_- _l_i_G_e_n_e_t_i_c_s standard format. Compared to _G_e_n_- _B_a_n_k format, this is over simplified, and obscures some significant information. It is, however, simple and congenial to importing and exporting data, and to interfacing to word pro- cessors. The sequence information is main- tained in it's entirety; this is, after all, our main goal. Each data file contains one or more sequences. Each sequence contains an indefinite number of comment lines containing a semi-colon (``;'') in the first column. The first line NOT having a semi-colon in column one contains _L_O_C_U_S _N_A_M_E, a textual label for the locus. For compatibil- ity with other software, the locus name should contain ten or fewer alphanumeric characters (most of our software will tolerate up to twenty characters in the locus name, and will tolerate blanks, tabs . . .). Following the locus name is the sequence infor- mation. This consists of one or more lines of characters of data. White characters will be stripped from the input. There should be no more than 95 characters per line. On output, Printed 11/1/88 DFCI 1 Mase(File Formats) UNIX Programmer's Manual Mase(File Formats) the sequence data will be formatted into lines of _S_E_Q_U_E_N_C_E-_L_I_N_E-_L_E_N_G_T_H characters each, with the white space discarded. The sequence infor- mation is terminated by the beginning of the comment block of the next sequence, or by the end of the file. _M_A_S_E provides the ability to work concurrently with sequences from several files; on saving, the sequences will be written back to the file from whence they came (_F_I_L_E-_M_O_V_E is provided to move a locus into another file). _M_A_S_E is concerned principally with editing the sequence information. Provision is made to edit the comment lines. Currently, there is no way to change the locus name within _M_A_S_E; this is likely to change. See also: _L_O_A_D and _S_A_V_E _S_I_M_I_L_A_R_I_T_Y _M_A_T_R_I_X _F_I_L_E_S The format defined for the matrix file is weak, and will likely be rethought or at least greatly expanded. This description was valid when written. Lines containing a semi-colon (``;'') in column 1 will be ignored, and may be used to contain comments. Each entry line contains a definition for a set of pairs. This definition consists of three fields, separated by spaces or tabs. The first field is a floating point number, which defines the score value for all pairwise comparisons that are to be defined. The second field defines character set one; the third, character set two. Pair wise comparisons for each member of set one with each member of set two is set to the score from field one. There are two special cases. A star (``*'') is taken to mean ``all characters''. If an amper- sand (``&'') composes the second field, then it is taken to mean each of the members in the first set compared to THEMSELVES. All lines in Printed 11/1/88 DFCI 2 Mase(File Formats) UNIX Programmer's Manual Mase(File Formats) this file are interpreted sequentially; early entries may be partially overwritten by later entries. By example: -1.0 * * would set ALL comparisons to have a score of ``-1.0''; this will define our mismatch score. 1.0 IVL IVL would make IxI, IxV, IxL, VxI, VxV, VxL, LxI, LxV, and LxL all have a score of 1.0 3.0 IVL & would make IxI, VxV, and LxL have a score of 3.0 0.0 - * would cause a gap across from anything have a score of 0.0. Be especially cautious about the order of the assignments. See also: _S_I_M_I_L_A_R_I_T_Y _C_H_A_R_A_C_T_E_R _D_I_S_P_L_A_Y _S_E_T _F_I_L_E_S _M_A_S_E allows one to use an aliasing character set to display sequence information. This affects only the display; specifically, it does not effect _S_A_V_E, _O_U_T_P_U_T, _S_E_A_R_C_H, _S_E_A_R_C_H- _A_G_A_I_N nor _P_A_T_T_E_R_N-_H_I_G_H_L_I_G_H_T. This mode may be useful for editing alignment of protein files; one could arrange it so that ``Isoleucine'' and ``Leucine'' would appear the same on the screen. The format of this file is extremely simple. Each line contains one ``record''. The first character of the line defines the character that will be displayed. Subsequent characters on that line are the sequence characters that should be displayed as the first one. Thus, if one of the lines was: iIL then both ``I'' (Isoleucine) and ``L'' (Leu- cine) would be represented on the screen by an ``i''. One may include blank lines. Lines that have a character in the first column will be ignored. This character was chosen since it is one of the few things that could never be displayed, since it takes more than one column position to output. This is Printed 11/1/88 DFCI 3 Mase(File Formats) UNIX Programmer's Manual Mase(File Formats) done to permit one to include comment lines. Note that a character may be used - one may desire to hide certain bases all together. See also: _C_H_A_R_A_C_T_E_R-_S_E_T, _C_O_N_S_E_N_S_U_S, and | _H_I_G_H_L_I_G_H_T-_D_I_F_F_E_R_E_N_C_E_S, | _O_U_T_P_U_T _M_A_P_P_I_N_G _S_T_R_I_N_G_S The format used to define string mappings for output are somewhat awkward to use, mainly because of the complexity of the escape sequences required to generate different colors or attributes for output devices. Lines beginning with a character are ignored, and may be used for comment lines. Lines containing no text are properly ignored. The character in column one defines which char- acter is to be mapped on output. The remainder of the line, columns two through the end of the line, contain the string that should be printed for the letter read in column one. This output string will be processed by _S_T_R_I_N_G _C_O_N_V_E_R_S_I_O_N (see discussion in the _I_N_T_R_O_D_U_C_- _T_I_O_N). For a special case, the string mapped to is used as a ``reset'' string, used to turn off all attributes (as for the ends of lines . . .) There are two different cases of _O_U_T_P_U_T, map- ping by letters and mapping by patterns. When mapping by letters occurs, the mapping string for characters encountered for output will be sent to the file instead. When mapping by pat- terns occurs, _O_U_T_P_U_T attempts to mimic the screen display when _P_A_T_T_E_R_N-_H_I_G_H_L_I_G_H_T has been used. The string specified for ``1'' will be sent to enable ``reverse video'', and the string speci- fied for ``0'' will be used to set ``normal video''. See also: _O_U_T_P_U_T in _I_N_T_R_I_N_S_I_C _F_U_N_C_T_I_O_N_S, - _S_T_R_I_N_G _C_O_N_V_E_R_S_I_O_N in _I_N_T_R_O_D_U_C_T_I_O_N, and the | Printed 11/1/88 DFCI 4 Mase(File Formats) UNIX Programmer's Manual Mase(File Formats) discussion of _O_U_T_P_U_T in the _T_U_T_O_R_I_A_L | EXAMPLES FILES SEE ALSO BUGS Printed 11/1/88 DFCI 5