Sequence Input in EMBOSS ======================== Introduction ------------ The core concept in sequence input for EMBOSS is the AjPSeqin object. This data structure hold the initial definition of an input sequence, and the data used during processing. The start for sequence input is the USA (Uniform Sequence Address) which is simply a string containing the specification of the input sequence. USAs can refer to a filename or to a database (defined through emboss.default). They can include a query by ID or accession (or, in principle, description but this is not yet implented). They can include the input sequence format for a file (for a database the format is part of the database definition). ++USAs are to be extended to query by sequence version, at the request of EBI. USAs have been extended to include the sequence range and (for DNA sequences) reverse orientation. These extensions are not yet in general use by the user community. Sequence input can be of 3 types, all using the same USA mechanism. Sequence: A single sequence Seqall: One sequence at a time until all input is exhausted Seqset: All sequences loaded together, like Seqall but with everything saved. Intended for tasks such as multiple sequence alignment input. Access Methods -------------- Sequence reading uses a variety of access methods, defined by the USA or by teh database definition. These are implemented in a set of functions defined in the seqAccess structure in ajseqdb.c The main method is seqAccessFile (reads a simple buffered file). For details on each one, see the appropriate function. Overview -------- Sequence input starts in the ACD processing, where the USA is given as input. See function acdSetSeq for the basic features. acdSetSeqall and acdSetSeqset are variants for the other input types. The user-specified USA is stored as the USA (ajSeqinUsa). Optional qualifiers can be provided by the user, and are stored as follows: sformat: FormatStr (format can also be part of the USA) sdbname: Db (only if needed specially for output, dbname is taken from the USA) sopenfile: Filename Rarely used. Should be simply the USA. sid: Entryname, used if the seuqnece input has no name to save inventing one. supper: Upper slower: Lower snucleotide: forces type to be nucleotide sprotein: forces type to be protein sbegin: Begin but the USA can provide this, and the user can be prompted send: End but the USA can provide this, and the user can be prompted sreverse: Rev but the USA can provide this, and the user can be prompted sask: Makes ACD processing prompt the user for begin, end (and rev) ufo: Ufo feature input specification if features are to be loaded fformat: FtQuery->Formatstr feature format if not in the Ufo or database fopenfile: Ftquery->Filename feature file, for example separate GFF file When all input is set, acdSetSeq calls ajSeqRead. This happens before any prompts for begin, end or reverse as these are applied to the sequence after it has been read. Function calls for sequence input --------------------------------- The function calls are extensively traced by ajDebug calls, so runnign any application with -debug will trace the steps of sequence input. seqret is usually the method of choice (or seqretset for sequence set input), but as this uses ajSeqallRead an application that reads a single sequence may be preferred for some testing. With 'make check' there is an application seqretsingle for this, but any application with sequence as the input type will be suitable (as will acdc with a test .acd file) ajSeqAllRead ------------ Runs as ajSeqRead to read the first sequence. The sequence is stored in an AjPSeqall object which is used to read further sequences in function ajSeqallNext. Calls seqRead to read the next sequence. Db and Entryname are set if values are set in AjPSeqin. ajSeqsetRead ------------ Writes an AjPSeqset object by reading all sequence input until exhausted and storing the sequences in an AjPList, converted at the end into an array. The code is simple. seqUsaProcess analyses the USA. A loop calling ajSeqRead reads each sequence. After reading, each sequence can have its type, dbname set from the AjPSeqin saved values. ajSeqRead --------- This is the key function in sequence reading. This function takes an AjPSeqin object as input, processes the USA and any other defined elements (format for example), generates internal data structures for database queries, opens some file for input, and reads the first sequence. ajSeqRead maintains the master list of known sequence input formats. If no format is specified, sequence raeding can iterate through this list until one is successful. This is stored in a static variable seqInFormat, which starts as a pointer to the static data structure, but can be reallocated and filled with a copy. There was be a good reason for cloning seqInFormat. It fixes any specified default format (EMBOSS_FORMAT variable) using seqSetInFormat. This requires write access to the data structure, and happens at the start of ajSeqRead. ++ it appears this cloning should be a separate function, also called from ajSeqAllRead or in seqRead, if it is useful to both. ajSeqRead is designed to continue processing after the first sequence has been read: (a) If there is an open file (seqin->Filebuff->File) reading continues. (b) If there is a list of USAs (from an @list type USA) the next USA in the list is processed, ready for the next step. With an open file, a newly processed USA, or simply an unprocesses AjPSeqin, processing is passed to seqRead. The result, assuming success, has its type defined (and, if in AjPSeqin, the dbname and entryname) and is returned. If seqRead fails, the next item in an @list of USAs can be processed. Without an @list there was only one USA to process and we will simply return failure. A while loop takes care of this step. seqRead ------- Called by ajSeqRead (and by ajSeqallRead). There are some tricky concepts in sequence reading to be covered here. See "Access Methods" above. Access functions are all in ajseqdb.c even if they read files rather than databases. Some access methods read one sequence at a time (seqin->Single is set). After the first sequence, seqRead calls the Access function again with the same AjPSeqin, for example to process the next sequence from a list of query results. The access function will leave AjPSeqin with the correct file open, at the correct offset, and with a format (if specified for a database) defined. If a format is specified, seqRead uses only that format, and returns sucecss (ajTrue)( if it works, but clears the AjPSeq and returns failure (ajFalse) if it fails. If no format is defined (usually only on the first call), seqRead will iterate through the internal format structure seqInFormatDef (cloned in ajSeqRead). Formats with Try set are used. The others are dangerous in that they could read nonsense input as valid, and are only used if specified by the user. Sequence reading uses seqReadFmt, which returns 4 status codes: FMT_OK: success, simply return ajTrue FMT_BADTYPE: Read something, but sequence type failed to match specification, for example from database definition. Return ajFalse. FMT_FAIL: failed to read. Probably a bad format. FMT_NOMATCH: Read something, but a query specification (usually id) failed to match. Can be caused by searching a file until the right entry is found. Has set the format, so after the switch block we break out of the format test but try again below. With no format, if the first attempt failed, a check is made on seqin->Search. If we are looking for a query match, we can try reading further with the same format until we find a match or exhaust the input. This simply uses a while loop with a call to seqReadFmt. On success, this simply returns ajTrue. If we are still in seqRead we have failed to find anything. A closing ajDebug call explains whether a format was defined, or if the search for a format failed. seqReadFmt ---------- ++ I was afraid someone would ask about this one. Here are the mysteries. seqReadFmt uses the definitions already set to attempt to read a sequence. The function arguments include the sequence format, as an index number into the format definitions, which are also passed in as they can be the default set or the set rewritten in ajSeqRead. Most of this function is very simple ... a big if block of what to do if a sequence is read by the format-specific input function (seqReadXxxxxx) . On failure, the input buffer is reset and FMT_FAIL is returned. ++ perhaps this could be reversed to have a small if block of processing failure. WOuld be easier to understand. The function starts by calling the input function for the format (see seqInFormatDef which applies even after rewriting of the data structure) For example, "fasta" format will be a call to seqReadFasta. After a sequence has been read, further processing is required (at least in some cases) to check whether it is the 'right' sequence. This is a call to seqQueryMatch which compares the sequence to the stored query data, for example an ID (dbname-id:id or filename:id as the USA) ++ again, reversing the test to put the failure code in an easier place would help. This clears the AjPSeq and returns FMT_NOMATCH If this passes, the AjPSeqin is checked for a feature specification. If features are required, they are read using ajFeatUfoRead which is a whole new topic in itself. Features are often not available, so this is expected to fail silently (for now) if no features are found. A final check is made for the sequnece type (a call to seqTypeCheck), and if all is well the sequence begin, and and rev settings (if any) are copied from AjPSeqin to the new AjPSeq. If the type fails, this returns FMT_BADTYPE. ++ This check should be before format reading