Count and RPKM File Format is a tab-delimited file format for loading HTS(High Throughput Sequencing Data) into MeV for analysis. We are currently providing RNASeq analysis algorithms only for these data sets.
MeV being a desktop software cannot provide functionalities like base calling and sequence alignment etc, which are computationaly intensive processes and should be be done on hign end server clusters.
The entry point for HTS data into MeV should be in a summarized form. Which means, the raw sequence data has been base-called and aligned and tag counts have been assembled, summarized and mapped to a reference genome at the transcript/gene level.
This loader supports both Counts and RPKM/FPKM and a combinations of both as described in the Data Type section below. Currently we have support for Human and Mouse data only and it is provided as annotaiton from RefSeq or ENSMBL.
Both Count & Expression (RPKM/FPKM) are maintained for this kind of loader. The user has the option of loading either or both kinds of info for there data set. If either Count or RPKM is left out, it is calculated based on the publication descibed in the section Count to RPKM and vice versa.
Currently we porvide annotation support for Human ans Mouse data only and would add support for more organisms in the future. However the user can provide annotaiton for any organism they want as they follow the annotation file format specifications. By selecting this option the user agrees to upload a annotation file which matches the example file provided in the data/rnaseq/ref_gene_h19_sample_anno.txt. The matching data file is also in the same location, data/rnaseq/TagSeqExample.txt. The table below describes the header information for an annotation file:
The 15 annotaton columns are described here. They columns should appear in file in the same order as described here:
## | Field Name | Description |
1 | PROBE_ID | Required. A unique ID the identifies each row in the data file.It is treated like the probe_id of microarray data. To be associated with a the data being loaded the nearest_ref_id filed in the data file should correspond to this field in the annotation file. This is generally expected to be a RefSeq or Ensemble ID but user can use anything as long they correspond in the data and the annotation file. |
2 | CHR | Required but an empty value is ok. Stands for chromosome and when provide the format should be: chr1 |
3 | STRAND | Required. When not avaialable for a row a '-' or '' can be used. |
4 | TX_START | Required. but an empty value is ok. When provided should indicate the BP position. |
5 | TX_END | Required, but an empty value is ok. When provided should indicate the BP position. |
6 | CDS_START | Required, but an empty value is ok. Stands for coding sequence start position and when provided should indicate the BP position. |
7 | CDS_END | Required, but an empty value is ok. Stands for coding sequence end position and When provided should indicate the BP position. |
8 | exonCount | Required, but an empty value is ok. Stands for count of exons in the coding sequence region, and when provided should be an integer value. Currently unused. |
9 | exonStarts | Required, but an empty value is ok. Stands for start BP postion of each exon, and when provided should be integer value(s) separated by commas and number of entries should match the exonCount. The format should be 29284557,29293459,. Currently unused. |
10 | exonEnds | Required, but an empty value is ok. Stands for end BP postion of each exon, and when provided should be integer value(s) separated by commas and number of entries should match the exonCount. The format should be 29284557,29293459,.Currently unused. |
11 | GENE_SYMBOL | Required, but an empty value is ok. Stands for gene Symbol. |
12 | GENE_TITLE | Required, but an empty value is ok. Stands for gene tile/description. |
13 | REFSEQ_ACC | Required, but an empty value is ok. Stands for accession number like RefSeq Id or Ensembl Id. |
14 | PROTEIN_ACC | Required,but an empty value is ok. Stands for protein accession number based on RefSeq Id or Ensembl Id. |
15 | ENTREZ_ID | Required, but an empty value is ok. Stands for Gene Id. |
Anootation File Required Characteristics
Annotation File E.g.:
PROBE_ID | CHR | STRAND | TX_START | TX_END | CDS_START | CDS_END | exonCount | exonStarts | exonEnds | GENE_SYMBOL | GENE_TITLE | REFSEQ_ACC | PROTEIN_ACC | ENTREZ_ID |
NR_024227 | chr19 | - | 50595745 | 50595866 | NAR-A6S | Some gene | NR_024227 | 100169957 | ||||||
NM_024328 | chr14 | + | 24025197 | 24028786 | 24025966 | 24028049 | 2 | 24025197,24027903, | 24026513,24028786, | THTPA | thiamine-triphosphatase | NM_024328 | NP_077304 | 79178 |
We allow the following type of data formats and combinations to be loaded. Each type specifies a file format and what kind of data the program is loading. All 4 formats starts with the same 5 annotaton columns. The file formats differ only in the data column(s) and data types; integer or float.
The 5 annotaton columns are described here. They columns should appear in file in the same order as described here:
## | Field Name | Description |
1 | tracking_id | Required. A unique ID the identifies each row in the data file. |
2 | locus | Required for each gene/row in the data file. The format should be: chr1:7838183-7838231 |
3 | nearest_ref_id | Required. When not avaialable for a row a '-' can be used. This field is used as a key accessioin to link into ReSeq or ENSMBL Dbs for known genomic regions. |
4 | class_code | Required. , but an empty value is ok. This column can be used for any kind of notes/status about the gene. |
5 | transcript_length | Required, but an empty value is ok. When not provided, the diff between the start and end BP in the locus is used as the length. |
Normalized Expression data in Reads per Kilobase per million unit (RPKM). The corresponding Count info is calculated by the progema. However, the user is required to provide a 'Library Size file' which should list the size of each sample library. An example:
tracking_id | locus | nearest_ref_id | class_code | transcript_length | Sample_1 | Sample_2 | Sample_n |
Gene_00002 | chr1:1431363-1431403 | NM_031921 | c | 161.726 | 20.44 | 81.2435 | |
Gene_00003 | chr1:2495127-2495222 | NM_003820 | c | 1786.3 | 285.454 | 482.786 | |
Gene_00004 | chr1:5446956-5447187 | - | - | 141.803 | 47.344 | 107.779 |
Discrete counts of sequence Reads aligned in a genomic region. The program calculates the corrsponding RPKM values for each observation, each sample. A 'Library Size file' is optional. If one is not provided, the sum of Counts in each sample is used as the lbrary size. An example:
tracking_id | locus | nearest_ref_id | class_code | transcript_length | Sample_1 | Sample_2 | Sample_n |
Gene_00002 | chr1:1431363-1431403 | NM_031921 | c | 20 | 81 | 9 | |
Gene_00003 | chr1:2495127-2495222 | NM_003820 | c | 285 | 482 | 69 | |
Gene_00004 | chr1:5446956-5447187 | - | - | 47 | 107 | 32 |
Both RPKM and Discrete counts of sequence Reads are provided by the user. This format is useful when the user does nto want to provide those values using custom methods without relying on the system. There are 2 strict requirements:
A 'Library Size file' is optional. An example:
tracking_id | locus | nearest_ref_id | class_code | transcript_length | Sample_1 | Sample_1 | Sample_2 | Sample_2 | Sample_n | Sample_n |
Gene_00002 | chr1:1431363-1431403 | NM_031921 | c | 2323.12 | 20 | 323.12 | 81 | 223.12 | 9 | |
Gene_00003 | chr1:2495127-2495222 | NM_003820 | c | 453.12 | 285 | 879.12 | 482 | 223.12 | 69 | |
Gene_00004 | chr1:5446956-5447187 | - | - | 443.12 | 47 | 2323.12 | 107 | 623.12 | 32 |
This uses another commonly used unit of HTS expression data called FPKM (Fragments per kilobase per million). Our program currently does not support an auto conversion from FPKM to Counts. As a result if the user wants to load this kind of data, one has to provide the corresponfing Count value as well. There are 2 strict requirements:
A 'Library Size file' is optional. An example:
tracking_id | locus | nearest_ref_id | class_code | transcript_length | Sample_1 | Sample_1 | Sample_2 | Sample_2 | Sample_n | Sample_n |
Gene_00002 | chr1:1431363-1431403 | NM_031921 | c | 2323.12 | 20 | 323.12 | 81 | 223.12 | 9 | |
Gene_00003 | chr1:2495127-2495222 | NM_003820 | c | 453.12 | 285 | 879.12 | 482 | 223.12 | 69 | |
Gene_00004 | chr1:5446956-5447187 | - | - | 443.12 | 47 | 2323.12 | 107 | 623.12 | 32 |
This should be a tab-delimited file wihtout header. Each row should have 2 columns: Sample name and Library Size. Comment lines are OK and should start with "#". An example:
# This is a comment | |
Sample_1 | 5454545 |
Sample_2 | 694545 |
Sample_n | 3245443 |
# This is the End |
When either RPKM or Count information is provided MeV calculates the other based on the publication by Mortazavi et al. Nature Methods - 5, 621 - 628 (2008). The supplemntary section describes the approach in detail. Here is the basic formula used: RPKM = Count/Library Size/TranscriptLength*1e+9
Rules and requirements