EMBL dumping documentation

EMBL dump now gets information from a Method object, which controls what is dumped where. In fact, you can specify dump field information either in the Sequence object itself, or a method, or a method inherited from the method ..., i.e. you can have defaults that you override either in a specialised method or in the object itself.

You can now dump any sequence object for which DNA can be found, directly or indirectly (so you can dump links or subsequences).

You can also now dump from giface, including gifaceserver (see example below).

There is now code for dumping arbitrary features according to dump information in the feature method, plus possible #Feature_info data from the Feature lines. This also works on Homol lines now.

Known bugs/incompletenesses:

I don't get arbitrary features from subsequences, and recurse on them. So in fact dumping links won't work. I realise.
I should use methods on subsequences when dumping them (partially done now; 980815).

Using EMBL_dump_info and associated methods

The order of processing the header information is:

Needs to be a sequence object. Needs to find DNA.
Complains if no object under _Clone [clone]. This will be used in the context of Clone_left_end information (see below).
The ID entry is derived from _Database "EMBL" [id] [ac]. If it exists then use [id]. Else use ID_template from the dump_info, replacing %s by the sequence object name.
AC lines are calculated from _Database "EMBL" [id] [ac]. If that does not exist, look under _Ac_number [ac]. Else nothing.
DE lines come from DE_format in dump_info. %s is substituted by the clone name if one exists, else the sequence name.
KW lines come from _Keyword [keyword] entries.
OS line from OS_line in method.
OC lines from OC_lines in method. It does not rewrap these so you have to get the lines right.
References. Submission reference gets RA names from From_author in the object, then Previous_author in the object; both can be multiple. Fills RL line from RL_submission, relacing DD-MMM-YYYY with the date obtained from Submitted in the object if that exists, else the current date. Then gets further references from the EMBL_reference in the method, parsing fields out of the standard Paper model.
Gets CC lines from CC_line in the method. Replace %s in these by clone name, or object name if that does not exist. Each CC_line text will be wrapped and appear as a separate paragraph separated by a blank XX line from the next paragraph.
Writes standard CC lines. First set of these give left and right end information using Clone_left_end and Clone_right end info from the clone for this object, and for other clones with ends in this sequence. The second set gives overlap information with neighbouring sequences.
Finally, write CC lines for explicit DB_remark entries from the object itself. These are also wrapped and separated into paragraphs by XX lines.

The feature table is written in the following order:

A source feature is written. This uses source_organism from the method to fill in /organism. The clone name is given in /clone if that exists. The chromosome and map qualifiers are filled from EMBL_chromosome and EMBL_map fields in the #EMBL_dump_info of the sequence, if they are there. If not, and there is a Map object for the sequence, then it looks inside that for a Text entry following the tag EMBL_chromosome, and if that is found, writes a /chromosome entry containing it.
Next it does the subsequences:
1. Prior to 980915 it looks for tags CDS, mRNA, and text following tRNA, snRNA, scRNA, misc_RNA and dumps accordingly. For the latter ones with text it produces a note containing the text, followed by "-tRNA" or "-RNA".
2. From 980915, in order to be dumped, a subsequence must have a Method, and that method must have an EMBL_feature, which specifies the feature key used. For pseudogenes use the key "CDS" (required by EMBL rules). The RNA notes are filled in as before.
3. For CDS checks if translatable without a stop codon, and if not then sets /pseudo qualifier and messerrors.
4. Builds the location information, going across LINK boundaries. Checks for Start_not_found and End_not_found tags and uses greater than and less than signs in these cases (but not /partial). In these cases, it sets codon_start where necessary.
5. If CDS_predicted_by a method, then dumps out /note="predicted using %s".
6. If the name of the object ends in a lower case letter, then prints out /note="preliminary prediction". EBI recognises this, and puts the resulting peptide into REM-TREMBL, not SP-TREMBL.
7. If a Locus attached then write /gene="{Locus}". If gene_from_name was set in the model, then it will also write out /gene="{object name}".
8. If a DB_remark attached to the object, then dump that as a note. Else if Brief_identification, dump /note="similar to {brief_id}".
9. For each sequence (normally an EST) attached via Matching_cDNA to the object, we print a line saying "cDNA EST EMBL:%s comes from this gene" or "cDNA EST %s comes from this gene" depending on whether we can find an accession number. 40 lines per /note qualifier.
10. If find TSL_site followed by an integer, then write a line /note="Possible trans-spliced leader site at {position}". Very worm specific.
Next, Features and Homols are dumped as described in the text below, if their methods have EMBL_dump information.
Finally, the sequence itself is dumped as an SQ line followed by sequence.

Model Changes

Create a new # (subobject) model:

#EMBL_dump_info	EMBL_dump_method UNIQUE ?Method
		ID_template UNIQUE Text	
		ID_division UNIQUE Text
		DE_format UNIQUE Text  
		OS_line UNIQUE Text    
		OC_line Text           
		RL_submission Text     
		EMBL_reference ?Paper  
		CC_line Text           
		source_organism UNIQUE Text   
		gene_from_name
		EMBL_chromosome UNIQUE Text
		EMBL_map UNIQUE Text

This information is made accessible by adding to the Method model:

	EMBL_dump_info #EMBL_dump_info

Add similarly to Sequence model:

	  DB_info	...
			EMBL_dump_info #EMBL_dump_info

The use of the shared subobject model makes things recursive. When looking for OS_line, for example, the first one that is found gets used, starting with information in the Sequence object, then in its EMBL_dump_method, then in its EMBL_dump_method...

Add to Map model:

	EMBL_chromosome UNIQUE Text

This determines how a Map object is transformed to a /chromosome="xx" line under the "source" feature key.

Models for dumping Features and Homols

?Sequence ...
//	  Homol

EMBL dumping documentation

Known bugs/incompletenesses:

Using EMBL_dump_info and associated methods

Model Changes

Models for dumping Features and Homols

Example .ace file

Example giface script

Example Output

Richard Durbin $Id: embl.doc.html,v 1.6 1999/01/25 16:53:25 edgrif Exp $