MeV File Loader - RNASeq File Format

RNASeq File Loader - Overview

Count and RPKM File Format is a tab-delimited file format for loading HTS(High Throughput Sequencing Data) into MeV for analysis. We are currently providing RNASeq analysis algorithms only for these data sets.

MeV being a desktop software cannot provide functionalities like base calling and sequence alignment etc, which are computationaly intensive processes and should be be done on hign end server clusters.

The entry point for HTS data into MeV should be in a summarized form. Which means, the raw sequence data has been base-called and aligned and tag counts have been assembled, summarized and mapped to a reference genome at the transcript/gene level.

This loader supports both Counts and RPKM/FPKM and a combinations of both as described in the Data Type section below. Currently we have support for Human and Mouse data only and it is provided as annotaiton from RefSeq or ENSMBL.

Both Count & Expression (RPKM/FPKM) are maintained for this kind of loader. The user has the option of loading either or both kinds of info for there data set. If either Count or RPKM is left out, it is calculated based on the publication descibed in the section Count to RPKM and vice versa.

Dialog Selections

Upload User Annotation []:

Currently we porvide annotation support for Human ans Mouse data only and would add support for more organisms in the future. However the user can provide annotaiton for any organism they want as they follow the annotation file format specifications. By selecting this option the user agrees to upload a annotation file which matches the example file provided in the data/rnaseq/ref_gene_h19_sample_anno.txt. The matching data file is also in the same location, data/rnaseq/TagSeqExample.txt. The table below describes the header information for an annotation file:

The 15 annotaton columns are described here. They columns should appear in file in the same order as described here:

##	Field Name	Description
1	PROBE_ID	Required. A unique ID the identifies each row in the data file.It is treated like the probe_id of microarray data. To be associated with a the data being loaded the nearest_ref_id filed in the data file should correspond to this field in the annotation file. This is generally expected to be a RefSeq or Ensemble ID but user can use anything as long they correspond in the data and the annotation file.
2	CHR	Required but an empty value is ok. Stands for chromosome and when provide the format should be: chr1
3	STRAND	Required. When not avaialable for a row a '-' or '' can be used.
4	TX_START	Required. but an empty value is ok. When provided should indicate the BP position.
5	TX_END	Required, but an empty value is ok. When provided should indicate the BP position.
6	CDS_START	Required, but an empty value is ok. Stands for coding sequence start position and when provided should indicate the BP position.
7	CDS_END	Required, but an empty value is ok. Stands for coding sequence end position and When provided should indicate the BP position.
8	exonCount	Required, but an empty value is ok. Stands for count of exons in the coding sequence region, and when provided should be an integer value. Currently unused.
9	exonStarts	Required, but an empty value is ok. Stands for start BP postion of each exon, and when provided should be integer value(s) separated by commas and number of entries should match the exonCount. The format should be 29284557,29293459,. Currently unused.
10	exonEnds	Required, but an empty value is ok. Stands for end BP postion of each exon, and when provided should be integer value(s) separated by commas and number of entries should match the exonCount. The format should be 29284557,29293459,.Currently unused.
11	GENE_SYMBOL	Required, but an empty value is ok. Stands for gene Symbol.
12	GENE_TITLE	Required, but an empty value is ok. Stands for gene tile/description.
13	REFSEQ_ACC	Required, but an empty value is ok. Stands for accession number like RefSeq Id or Ensembl Id.
14	PROTEIN_ACC	Required,but an empty value is ok. Stands for protein accession number based on RefSeq Id or Ensembl Id.
15	ENTREZ_ID	Required, but an empty value is ok. Stands for Gene Id.

Anootation File Required Characteristics

Tab Delimited
PRODE_ID in annotation file is expected to match nearest_ref_id in data file.
All columns are required in the order they are specified but blank/empty values are accepted in most
Some columns are unused at this time. Empty columns are ok for those. The table above lists unused columns.
Data rows with unmatched annoation row will still be loaded
If multiple annotaion rows matches one data row, the first one is used.

Annotation File E.g.:

PROBE_ID	CHR	STRAND	TX_START	TX_END	CDS_START	CDS_END	exonCount	exonStarts	exonEnds	GENE_SYMBOL	GENE_TITLE	REFSEQ_ACC	PROTEIN_ACC	ENTREZ_ID
NR_024227	chr19	-	50595745	50595866						NAR-A6S	Some gene	NR_024227		100169957
NM_024328	chr14	+	24025197	24028786	24025966	24028049	2	24025197,24027903,	24026513,24028786,	THTPA	thiamine-triphosphatase	NM_024328	NP_077304	79178

Data Type:

We allow the following type of data formats and combinations to be loaded. Each type specifies a file format and what kind of data the program is loading. All 4 formats starts with the same 5 annotaton columns. The file formats differ only in the data column(s) and data types; integer or float.

The 5 annotaton columns are described here. They columns should appear in file in the same order as described here:

##	Field Name	Description
1	tracking_id	Required. A unique ID the identifies each row in the data file.
2	locus	Required for each gene/row in the data file. The format should be: chr1:7838183-7838231
3	nearest_ref_id	Required. When not avaialable for a row a '-' can be used. This field is used as a key accessioin to link into ReSeq or ENSMBL Dbs for known genomic regions.
4	class_code	Required. , but an empty value is ok. This column can be used for any kind of notes/status about the gene.
5	transcript_length	Required, but an empty value is ok. When not provided, the diff between the start and end BP in the locus is used as the length.

RPKM

Normalized Expression data in Reads per Kilobase per million unit (RPKM). The corresponding Count info is calculated by the progema. However, the user is required to provide a 'Library Size file' which should list the size of each sample library. An example:

tracking_id	locus	nearest_ref_id	class_code	transcript_length	Sample_1	Sample_2	Sample_n
Gene_00002	chr1:1431363-1431403	NM_031921	c		161.726	20.44	81.2435
Gene_00003	chr1:2495127-2495222	NM_003820	c		1786.3	285.454	482.786
Gene_00004	chr1:5446956-5447187	-	-		141.803	47.344	107.779

Count

Discrete counts of sequence Reads aligned in a genomic region. The program calculates the corrsponding RPKM values for each observation, each sample. A 'Library Size file' is optional. If one is not provided, the sum of Counts in each sample is used as the lbrary size. An example:

tracking_id	locus	nearest_ref_id	class_code	transcript_length	Sample_1	Sample_2	Sample_n
Gene_00002	chr1:1431363-1431403	NM_031921	c		20	81	9
Gene_00003	chr1:2495127-2495222	NM_003820	c		285	482	69
Gene_00004	chr1:5446956-5447187	-	-		47	107	32

RPKM & Count

Both RPKM and Discrete counts of sequence Reads are provided by the user. This format is useful when the user does nto want to provide those values using custom methods without relying on the system. There are 2 strict requirements:

The sample names of the 2 columns should be the same.
The first coulums is treated as RPKM and the second one as Count.

A 'Library Size file' is optional. An example:

tracking_id	locus	nearest_ref_id	class_code	transcript_length	Sample_1	Sample_1	Sample_2	Sample_2	Sample_n	Sample_n
Gene_00002	chr1:1431363-1431403	NM_031921	c		2323.12	20	323.12	81	223.12	9
Gene_00003	chr1:2495127-2495222	NM_003820	c		453.12	285	879.12	482	223.12	69
Gene_00004	chr1:5446956-5447187	-	-		443.12	47	2323.12	107	623.12	32

FPKM & Count

This uses another commonly used unit of HTS expression data called FPKM (Fragments per kilobase per million). Our program currently does not support an auto conversion from FPKM to Counts. As a result if the user wants to load this kind of data, one has to provide the corresponfing Count value as well. There are 2 strict requirements:

The sample names of the 2 columns should be the same.
The first coulums is treated as RPKM and the second one as Count.

A 'Library Size file' is optional. An example:

tracking_id	locus	nearest_ref_id	class_code	transcript_length	Sample_1	Sample_1	Sample_2	Sample_2	Sample_n	Sample_n
Gene_00002	chr1:1431363-1431403	NM_031921	c		2323.12	20	323.12	81	223.12	9
Gene_00003	chr1:2495127-2495222	NM_003820	c		453.12	285	879.12	482	223.12	69
Gene_00004	chr1:5446956-5447187	-	-		443.12	47	2323.12	107	623.12	32

Species: Currently we porvide annotation support for Human ans Mouse data only. Other requirments of Ref Genome type and build versions apply.
1. Human
2. Mouse
Reference Genome: Whis refrence genome was used to map align the reads. The same ref genome would be used for other annotaiton. We currently support RefSeq and ENSMBL models. This is important as the 'nearest_ref_id' field is used to link to the annotaiton DB and an incorrect selection would lead to undesirable results.
1. RefSeq
2. ENSMBL
UCSC Build This field specifies the Reference Genome version used to map/aligh the reads. It is important to specifiy this correctly as the mapping information changes enough between versions to lead to mapping mistakes.
1. Human: hg19 & hg18
2. Mouse: mm9 & mm8
Read Length Optional Filed to specify the length of sequence reads in the experiments. For future use.

Library Size File

This should be a tab-delimited file wihtout header. Each row should have 2 columns: Sample name and Library Size. Comment lines are OK and should start with "#". An example:

# This is a comment
Sample_1	5454545
Sample_2	694545
Sample_n	3245443
# This is the End

Count to RPKM and vice versa

When either RPKM or Count information is provided MeV calculates the other based on the publication by Mortazavi et al. Nature Methods - 5, 621 - 628 (2008). The supplemntary section describes the approach in detail. Here is the basic formula used: RPKM = Count/Library Size/TranscriptLength*1e+9

Rules and requirements

When RPKM is provided, A library Size file is requred. When Count is provided the file is optional and MeV takes the sum of the counts of each sample as the library size.
When transcript_length annotation column is left empty in the data file MeV calculates the same from the locus as the diff between start and end BP.