Description of Annotation Files
An annotation file is a tab-delimited text file containing annotation data for a specific slide_type. mev files can be associated with an annotation file only if both types of files are based on the same slide_type. The keys to this association are the unique ids in both files. Rows of mev and annotation files can be associated with each other if the unique ids are identical. A single header row is required to precede the annotation data in order to identify the columns below. Each remaining row of the file stores annotation data for a particular spot/feature on the array.
Annotation files may contain any number of non-computational comment lines. These lines, starting with '#', will be treated identically to comment lines in mev files, and should precede the header row.
Annotation files created at TIGR will use UIDs that match the format used in the mev files, most likely database_name:spot_id. The structure of each annotation file is detailed below. The header row consists of headers that identify each column of data. Each subsequent row of the file stores data for a particular spot/feature on the array. The annotation files created at TIGR will typically contain at least one comment at the top of the file with the following information:
version | Version number based on revisions of annotation data | |
format_version/td> | The version of the .mev file format document | |
date/td> | Date of file creation or update | |
analyst/td> | Owner or the person responsible for creating the file | |
created_by/td> | Software tool used to create the document | |
gi_version/td> | Version of the Gene Indices (or db?) that produced this annotation data | |
slide_type/td> | type from the slide_type table that this array is based on | |
output_row_count/td> | Number of rows of annotation (eg. non-header) data | |
description/td> | Common name or other details about the experiment |
An example of the leading comments:
# version: V3.0
# format_version: V4.0
# date: 04/20/2004
# analyst: jwhite
# created_by: Database script
# gi_version: 3.0
# slide_type: IASCAG1
# output_row_count: 32448
# description: Standard annotation file
The header row consists of the field names for each subsequent row in this file. Only the UID field is required. It must be the first field present and it must be named 'UID'. Any number of additional fields may be included. Annotation files created at TIGR will always contain the following columns:
UID | unique identifier for this line of annotation |
R | row (slide row) |
C | column (slide column) |
The remaining fields may vary, and a standard set has yet to be determined. Such a list will be published on a future date. R and C have been included to allow for manual alignment of the mev and corresponding annotation files in the event that the mev files were not generated in a traditional manner (ie. using Madam, etc.).
Some varieties of annotation files follow. The format may vary depending on the purpose of the file:
UID \t R \t C \t FeatN \t GBNum \t TCNum \t ComN \t …
UID \t R \t C \t GeneN \t Rxn \t PathwayN \t …
UID \t R \t C \t FeatN \t End5 \t End3 \t ChrNum \t …
Of course, it would be possible to combine the fields of these files, or add fields that have not been mentioned here. The goal is to keep the annotation flexible and the processing seamless.
There are not any naming conventions for annotation files at this time. If such a standard is introduced in the future, it will be detailed here.