Readseq Extended Help

History
General reformatting options
Graphic User Interface options
Command Line Interface options
Common Gateway Interface options
Feature handling
Biosequence Formats
Known Formats
Feature Table-only Formats
Other Formats
Efficiency and Limitations
Programming with Readseq
Source code
Availability
How to compile with Metrowerks Codewarrior
How to compile with Sun javac
Release notes
Et cetera

History

Readseq was written originally around 1989 a component of a sequence analysis program, in Pascal, but when I added a small, simple command-line interface, it took on a life of its own as a conversion program for bioinformatics. It's main contribution to bioinformatics is it takes on the job of guessing what your input biosequence data format is, and converting it to what your software knows how to handle.

It was converted to a C program in early 1990's and after an update in 1993, remained as is for several years, as I wrote around it and moved thru C++ then on to Java as a primary bioinformatics language. During this time, I'd wanted often to teach readseq to handle sequence documentation; the original ignored all but a few fields of information other than sequence data. In late 1990's my sequence analysis program SeqPup was in its Java incarnation and needed this, especially handling of feature annotations, locating genes, introns and such in a sequence.

After slowly updating readseq from a haphazard C program to an object oriented structure in Java, I pulled it back out of its parent source to become again a stand-alone program for format conversion. This release version 2, first available in 1999, continues support for the "classic" C version, in that it supports the same command-line options, but has extensions for sequence documentation, feature table and other additions, plus new sequence format conversions, and a lot of bug fixing. This java version is also more efficient, working faster than the compiled C classic version. It still isn't efficient enough to handle large sequences (genome sized or full GenBank/EMBL data release files).

In its current Java incarnation, interfacing Readseq with other languages is done mainly through command-line calls to the main program. If your programs are in Perl, you may want to use the bioperl.org collection with its SeqIO package.

PUBLIC DOMAIN NOTICE:
This software is freely available to the public for use. The author, Don Gilbert, of Readseq and the Java package 'iubio.readseq' does not place any restriction on its use or reproduction. Developers are encourged to incorporate parts in their programs. I would appreciate being cited in any work or product based on this material. This software is provided without warranty of any kind.

-- Don Gilbert
software@bio.indiana.edu, May 2001
Bioinformatics group, Biology Department & Cntr. Genomics & Bioinformatics,
Indiana University, Bloomington, Indiana

General reformatting options

Graphic User Interface options

This version includes a graphic user interface for those who prefer not to learn the many command line options, or who's workstation lacks a command-line interface (which probably includes most but the hardy bioinformaticians :).

The call format 'java -cp readseq.jar app' starts this simple graphic user interface, based on Java Swing. For MacOS, see the ReadseqApp driver program which does the equivalent to this commandline call. The menu options in this 'app' interface cover the equivalent of the command line options (see below).

The steps to use this are

open sequence files or URLs,
select output file,
select an output format from the Output format popup menu,
select any output options such as feature restrictions,
process files

Progress of the processing is listed in the Message window.

File menu:

About Readseq...: - display some help documentation.
Open sequence files...: - open sequence files from selector dialog;
Open URLs to sequence...: - open sequence data from Internet URL (http://, and probably ftp:// work)
Save as...: - Select output file for converted sequences
Clear list of open sequences: - This removes any opened files from list to process. Files are not converted until 'Process' is chosen, but you can add any number of files to process with the 'Open' command before hand.
Process selected files: - Convert 'Open' sequences to selected format, saving to 'Save' file

Options menu:

List sequences only
Verbose progress report
Calculate checksum of sequences
Change seq to lower case
Change seq to UPPER CASE
No case change
Remove gap symbols
Reverse-complement sequence
Translate input symbol(s) to output symbol(s)
Set range of sequence,features to extract: Extract data from a given base range, e.g. 500..10000
No feature table selection
Extract sequence of selected features...: Extract sequence based on features, only where input data includes feature tables.
Remove sequence of selected features...: Remove sequence based on features, only where input data includes feature tables.
Set subrange of selected features: Subrange location of extracted features (e.g., -1000..10 for upstream bases; end..100 for downstream)
XML - include document type definition

Pretty print menu:
These options set parameters for the Pretty-print format, such as numbering and labels.

Command Line Interface options

  Read & reformat biosequences, Java command-line version
  Usage: java -cp readseq.jar run [options] input-file(s)
  For more details: java -cp readseq.jar help more

  Options
    -a[ll]              select All sequences
      'all' will cause processing of all sequences (default now for version 2, for
      compatibility with version 1).  Use items=1,2,3 to select a subset.
      
    -c[aselower]        change to lower case
    -C[ASEUPPER]        change to UPPER CASE
      'caselower' and 'CASEUPPER' will convert sequence case
      
    -degap[=-]          remove gap symbols
      'degap=symbol' will remove this symbol from output sequence (- normally)
      
    -f[ormat=]#         Format number for output,  or
    -f[ormat=]Name      Format name for output
          see Formats   list below for names and numbers
       'format=genbank', 'format=gb', 'format=xml', et cetera to select an output
       format.  You can also use format number (below), but these numbers may change
       with revisions.  Alternate names of formats are listed below (e.g.,
       'Pearson|FastA|fa' allows 'pearson', 'fasta', or 'fa' as name.  This is
       case-insensitive.
          
    -inform[at]=#       input format number,  or
    -inform[at]=Name    input format name.  Assume input data is this format
       'inform=genbank' lets you specify data input format.  Normally Readseq
       guesses the input format (usually correctly). Use this option if you wish
       to bypass this input format guessing.
       
    -i[tem=2,3,4]       select Item number(s) from several
      'items=2,3,4'  will select these sequence records from a multisequence 
      input file.
      
    -l[ist]             List sequences only
      'list' will list titles of sequence records
      
    -o[utput=]out.seq   redirect Output
       'output=file', send output to named file.
       
    -p[ipe]             Pipe (command line, < stdin, > stdout)
      'pipe' will cause input data to come from STDIN and output go to
      STDOUT unix standard files (unless -out is given and input file given),
      and no prompting or progress reports will occurr.
      
      
    -r[everse]          reverse-complement of input sequence
      'reverse' will write the sequence from end to start, and DNA bases are
      complemented.  Amino residues are not complemented.
      
    -t[ranslate=]io     translate input symbol [i] to output symbol [o]
                        use several -tio to translate several symbols
      translates given sequence bases, e.g. -tAN to change 'A' to 'N'
                        
    -v[erbose]          Verbose progress
      'verbose' will print some progress reports
      
    -ch[ecksum]         calculate & print checksum of sequences


   Documentation and Feature Table extraction:
   
    -feat[ures]=exon,CDS...   extract sequence of selected features
    -nofeat[ures]=repeat_region,intron... remove sequence of selected features 
      'feature=CDS,intron' lets you specify those features to extract, or remove, 
      in the output.  Currently this causes each feature to produce a new
      sequence record.  Use the '
    
    -field=AC,ID...      include selected document fields in output
    -nofield=COMMENT,... remove selected document fields from output 
    
See below for subrange option
    -subrange=-1000..10  * extract subrange of sequence for feature locations
    -subrange=1..end      
    -subrange=end-10..end+99  
    
    -extract=10000..99999   extract all features and sequence from given base range

    
See below for pair, unpair options
    -pair=1              * combine features (fff,gff) and sequence files to one output
    -unpair=1            * split features,sequence from one input to two files


   Pretty format options:
    -wid[th]=#            sequence line width
    -tab=#                left indent
    -col[space]=#         column space within sequence line on output
    -gap[count]           count gap chars in sequence numbers
    -nameleft, -nameright[=#]   name on left/right side [=max width]
    -nametop              name at top/bottom
    -numleft, -numright   seq index on left/right side
    -numtop, -numbot      index on top/bottom
    -match[=.]            use match base for 2..n species
    -inter[line=#]        blank line(s) between sequence blocks


This program requires a Java runtime (java or jre) program, version 1.1.x, 1.2 or later
The leading '-' on option is optional if '=' is present.  All non-options
(no leading '-' or embedded '=') are used as input file names.
These options and call format are compatible with the classic readseq (v.1992)
* New experimental feature handling options, may not yet work as desired.

To test readseq with a built-in data set, use: java -cp readseq.jar test

Common Gateway Interface options

This version includes a Common gateway interface (CGI) for use in web servers. The call format 'java -cp readseq.jar cgi' starts this. To use in a web server on a unix system, install a shell script like this as '/cgi-bin/readseq.cgi':

  #!/bin/sh
  envtemp=/tmp/rseq$$.env
  env > ${envtemp}
  /usr/java/bin/java -cp readseq.jar cgi env=${envtemp}
  /bin/rm ${envtemp}

The options in this 'cgi' interface include all of the command line options (see below). The 'env & ${envtemp}' is used to pass CGI environment variables to readseq. Included in the readseq.jar rez/ folder is the CGI form presented as 'cgiform.html'. If you wish to customize this, extract it from readseq.jar, edit, and put into the classpath ahead of readseq.jar, as

  -- where this directory contains reaseq.jar and readseq.cgi --
  jar -xf readseq.jar rez/cgiform.html
  -- edit rez/cgiform.html --
  /usr/java/bin/java -cp .:readseq.jar cgi env=${envtemp}

The options for using readseq cgi are same as for readseq command line, with a few additions (see iubio.readseq.cgi java source).

Feature handling

Version 2 added document and feature table parsing, which has become an essential need in sequence manipulation. Release 2.1 further enhances and debugs this aspect.

One can currently extract sequence of a given set of features from a biosequence with feature tables.

  Option: feature=gene,mRNA,repeat (feat for short)
  
  -- produce embl file gene features only
  java -cp readseq run format=embl out=flygenes.em feat=gene,mRNA  fly-X.gb

Updates include extraction of proper strand orientation. E.g. extracting a feature with location 'complement(join(1..10,20..30))' produces the reverse-complement of bases 1..10 and 20..30
In development are methods to extract subranges of features, and produce output records for each or selected features, to let one, for instance, extract sequence in a range upstream/downstream from gene features.
Feature subrange extractions are accomplished with

  Option:  subrange=#start..#stop
   
    -subrange=-1000..10    extract subrange of sequence for feature locations
    -subrange=1..end      
    -subrange=end-10..end+99  
          -1000...10 is 1000 bases upstream/before to 10 in feature,
             1..end is full feature range (default for no -subrange option)
             end-10..end+99 is 10 before end to 99 after end of feature
             only valid with -feat/-nofeat

Sequence Range Math

Readseq handles subranges as the intersection with a given feature location. Subrange math respects 'complement()' orientation, as well as the 'end' value to specify end of a given location. Subranges can be a join() of several ranges. The subrange is intersected with a given feature range, and extended beyond its ends as specified. Here are some subrange computation examples:

subrange(-100..10) -- 100 bases upstream of feature to 10 bases inside.
subrange(1..end1)  -- the full range of the feature.
subrange(end1..end100) -- from the end to 100 bases downstream of the feature.
subrange(-100..end100) -- 100 bases upstream to 100 bases downstream.
subrange(join(-100..10,end1..end100)) -- 100 bases upstream to 10 into, 
                       joined with the end to 100 bases downstream.

Test SeqRange.subrange()
10..60 & subrange(-100..10)  = -91..19
10..60 & subrange(1..end1)  = 10..60
10..60 & subrange(end1..end100)  = 60..159
10..60 & subrange(-100..end100)  = -91..159
10..60 & subrange(join(-100..10,end1..end100))  = join(-91..19,60..159)

complement(10..60) & subrange(-100..10)  = complement(51..161)
complement(10..60) & subrange(1..end1)  = complement(10..60)
complement(10..60) & subrange(end1..end100)  = complement(-89..10)
complement(10..60) & subrange(-100..end100)  = complement(-89..161)
complement(10..60) & subrange(join(-100..10,end1..end100))  = complement(-89..10,51..161)

join(1..5,20..40,55..60) & subrange(-100..10)  = -100..5
join(1..5,20..40,55..60) & subrange(1..end1)  = join(1..5,20..40,55..60)
join(1..5,20..40,55..60) & subrange(end1..end100)  = 60..159
join(1..5,20..40,55..60) & subrange(-100..end100)  = join(-100..5,20..40,55..159)
join(1..5,20..40,55..60) & subrange(join(-100..10,end1..end100))  = join(-100..5,60..159)

complement(1..5,20..40,55..60) & subrange(-100..10)  = complement(55..161)
complement(1..5,20..40,55..60) & subrange(1..end1)  = complement(1..5,20..40,55..60)
complement(1..5,20..40,55..60) & subrange(end1..end100)  = complement(-98..1)
complement(1..5,20..40,55..60) & subrange(-100..end100)  = complement(-98..5,20..40,55..161)
complement(1..5,20..40,55..60) & subrange(join(-100..10,end1..end100))  = complement(-98..1,55..161)

join(1000..1500,2000..2500,3000..3100) & subrange(-100..10)  = 899..1009
join(1000..1500,2000..2500,3000..3100) & subrange(1..end1)  = join(1000..1500,2000..2500,3000..3100)
join(1000..1500,2000..2500,3000..3100) & subrange(end1..end100)  = 3100..3199
join(1000..1500,2000..2500,3000..3100) & subrange(-100..end100)  = join(899..1500,2000..2500,3000..3199)
join(1000..1500,2000..2500,3000..3100) & subrange(join(-100..10,end1..end100))  = join(899..1009,3100..3199)

complement(1000..1500,2000..2500,3000..3100) & subrange(-100..10)  = complement(3091..3201)
complement(1000..1500,2000..2500,3000..3100) & subrange(1..end1)  = complement(1000..1500,2000..2500,3000..3100)
complement(1000..1500,2000..2500,3000..3100) & subrange(end1..end100)  = complement(901..1000)
complement(1000..1500,2000..2500,3000..3100) & subrange(-100..end100)  = complement(901..1500,2000..2500,3000..3201)
complement(1000..1500,2000..2500,3000..3100) & subrange(join(-100..10,end1..end100))  = complement(901..1000,3091..3201)

Biosequence Formats

Known Formats

ID	Name	Read	Write	Int'leaf	Features	Sequence	Suffix	Content-type
1	IG\|Stanford	yes	yes	--	--	yes	.ig	biosequence/ig
2	GenBank\|gb	yes	yes	--	yes	yes	.gb	biosequence/genbank
3	NBRF	yes	yes	--	--	yes	.nbrf	biosequence/nbrf
4	EMBL\|em	yes	yes	--	yes	yes	.embl	biosequence/embl
5	GCG	yes	yes	--	--	yes	.gcg	biosequence/gcg
6	DNAStrider	yes	yes	--	--	yes	.strider	biosequence/strider
7	Fitch	--	--	--	--	yes	.fitch	biosequence/fitch
8	Pearson\|Fasta\|fa	yes	yes	--	--	yes	.fasta	biosequence/fasta
9	Zuker	--	--	--	--	yes	.zuker	biosequence/zuker
10	Olsen	--	--	yes	--	yes	.olsen	biosequence/olsen
11	Phylip3.2	yes	yes	yes	--	yes	.phylip2	biosequence/phylip2
12	Phylip\|Phylip4	yes	yes	yes	--	yes	.phylip	biosequence/phylip
13	Plain\|Raw	yes	yes	--	--	yes	.seq	biosequence/plain
14	PIR\|CODATA	yes	yes	--	--	yes	.pir	biosequence/codata
15	MSF	yes	yes	yes	--	yes	.msf	biosequence/msf
16	ASN.1	--	--	--	--	yes	.asn	biosequence/asn1
17	PAUP\|NEXUS	yes	yes	yes	--	yes	.nexus	biosequence/nexus
18	Pretty	--	yes	yes	--	yes	.pretty	biosequence/pretty
19	XML	yes	yes	--	yes	yes	.xml	biosequence/xml
20	BLAST	yes	--	yes	--	yes	.blast	biosequence/blast
21	SCF	yes	--	--	--	yes	.scf	biosequence/scf
22	Clustal	yes	yes	yes	--	yes	.aln	biosequence/clustal
23	FlatFeat\|FFF	yes	yes	--	yes	--	.fff	biosequence/fff
24	GFF	yes	yes	--	yes	--	.gff	biosequence/gff
25	ACEDB	yes	yes	--	--	yes	.ace	biosequence/acedb

ID is a number that can be used for this format (name is prefered). Alternate names are separated by '|'. You can use any of these to specify a format. Read and Write specify whether Readseq can read and write this format. Int'leaf means the format is interleaved. Features says if sequence record documentation and features are parsed. Sequence indicates if the format contains sequence data. Content-type is the magic string sent for that format thru a CGI web server. The suffix is the standard file suffix used for that format.

Feature Table-only Formats

These are increasingly useful when working with genome annotations, as the features are stored separately from sequence data (which is typically in fasta or raw format). The most widely used feature table syntax is the DDBJ/GenBank/EMBL Feature Table format, used in their respective flat file data records. The Gene-Finding/General Feature Format (GFF, from Sanger Centre) is also increasingly common for genome annotation work. Various flavors of XML are also used for feature annotations of sequence data.

Readseq v 2.1 can merge feature-only and sequence-only files to produce feature+sequence extracted outputs.

  Option: pair-feature-seq (pair for short) 
  Option: unpair-feature-seq (unpair for short) 

  -- produce embl file from feature,sequence pair
  java -cp readseq run pair=1 format=embl out=flypair.em features-Xtop.tsv dna-Xtop.raw

  -- produce embl file of subrange from gene features using feature,sequence pair
  java -cp readseq run pair=1 format=embl out=flypairex.em feat=gene,mRNA subrange=-1000..10 features-Xtop.tsv dna-Xtop.raw

  -- produce fff,fasta file pair from embl sequence
  java -cp readseq run unpair=1 format=fff out=flyunpair.fff flypair.em

The pair option causes input sequence files to be checked and read as a pair of feature,sequence files. The unpair option will split an input feature+sequence file into paired feature, sequence files.

FFF - Flattened Feature Format (FlatFeat)
This is essentially the DDBJ/GenBank/EMBL Feature Table specification with newlines and extra space removed, to produce single-line feature entries which can more efficiently be indexed and read by software. Basic FFF format is a tab-separated-value file, with these columns: Key Location Qualifiers
```
# flatfeat-version 2
Key Location  Qualifiers
source  1..1684   /organism="Drosophila melanogaster" ;/chromosome="3L" ; /map="69C" ; /ACCESSION="U57488" ...
gene  18..1684  /gene="est-6"
CDS join(18..1404,1456..>1684)  /gene="est-6" ;...
exon  18..1404  /gene="est-6"
intron  1405..1455  /gene="est-6"
exon  1456..>1684 /gene="est-6"
     
```
Similar to GFF, FFF works better for me as it contains all information about a feature in one row. GFF splits gene, mRNA, other multi-location features among several rows, and currently lacks a standard syntax for grouping these feature parts.

GFF - General Feature Format
originally Gene Finding Format, find this specification at http://www.sanger.ac.uk/Software/formats/GFF/. When generating GFF format, Readseq uses 'group FeatureName_ItemNumber ; ' as the first attribute to indicate a feature that is split among lines.

##gff-version 2
# seqname source  feature start end score strand  frame attributes
DMU57488  - source  1 1684  . + . organism "Drosophila melanogaster" ; chromosome "3L" ...
DMU57488  - gene  18  1684  . + . gene "est-6"
DMU57488  - CDS 18  1404  . + . group CDS_1 ; gene "est-6" ; ...
DMU57488  - CDS 1456  1684  . + . group CDS_1
DMU57488  - exon  18  1404  . + . gene "est-6"
DMU57488  - intron  1405  1455  . + . gene "est-6"
DMU57488  - exon  1456  1684  . + . gene "est-6"

Other Formats

XML - a simple conversion of GenBank/EMBL document standard elements to XML tags. Later releases of Readseq may add support for other XML versions (NCBI MinSeq and TinySeq dtds, others?).
XML may be akin to a spreadsheet format for bioinformatics data; it is flexible enough to add new fields and content, in contrast to most other biosequence formats, excepting ASN.1. However, the data structure and field specifics for each XML implementation differ widely; just formatting data with angle bracket tags doesn't solve the problem of making it readable by software.

I don't know yet of an XML format for biosequence data that in general use, though some being used for genome annotation may well make it into Readseq's parser soon. For instance, TIGR's XML for genome annotation makes good sense, but it uses a different data structure than Readseq-XML, so requires a different software unit to parse.
ASN.1 - NCBI's 'iron-man' biosequence format is complex, and currently the time to support it in Readseq is prohibitive. As Jim Ostell notes, ASN.1 leads XML in its functionality for biosequence and related bioinformatic data (who wants to see Genbank's size grow 10-fold just through a format conversion?). ASN.1 provides more precise syntax (what is an integer in XML?) for bioinformatic data. But alas, the software to parse ASN.1 is more complex and less readily available than for XML.

Efficiency and Limitations

Release 2 in Java of readseq has been completely revised to object-oriented and efficient data handling. It is faster than release 1 which was compile C code, by a factor of 2 to 4 times. This isn't because Java yet matches native compiled C code for speed (though it is approaching it), but because the original Readseq was not efficient.

Readseq 2 appears faster in limited testing than Bioperl's SeqIO Perl modules, by roughly a factor of 2 (using Java 1.3 and Perl 5.005 on a Sun Solaris system). Release 2 is substatially improved over release 1, and should provide more reliable results as well as the many improvements in documentation and feature handling. It retains command-line compatibility with release 1, so should be a drop-in replacement.

Readseq is currently not recommended for very large (100+MB) sequence files, whether as a single record or multiple records. Hopefully this limit will be lifted in coming months to enable use with genome-sized sequences and feature tables.

The main limitation is memory usage - it is not optimized for large data sets, but reads all of a sequence record in memory, generating numerous objects for documentation and a byte array for sequence.

Programming with Readseq

Readseq started out as a component of a sequence analysis program, but when I added a simple command-line interface, it took on a life of its own as a conversion program for bioinformatics.

Readseq started out as a component of a sequence analysis program, and continues as a component tool, which I use in various other programs. For use with other Java programs, add the readseq.jar to your library class path, then call Readseq public entries.

Here is one basic variant to convert A to B formats:

 
// compile as: javac -classpath readseq.jar:$CLASSPATH  testrsq.java
// run as:     java -cp .:readseq.jar testrsq  inputfile(s)...
import java.io.*;
import iubio.readseq.*;
class testrsq {
  public static void main(String[] args) {
    try {
    int outid= BioseqFormats.formatFromName("fasta");
    BioseqWriterIface seqwriter= BioseqFormats.newWriter(outid);
    seqwriter.setOutput( System.out);
    seqwriter.writeHeader();
    Readseq rd= new Readseq();
    for (int i=0; i<args.length; i++) {
      rd.setInputObject( args[i] );
      if (rd.isKnownFormat() && rd.readInit())
      rd.readTo( seqwriter);
      }
    seqwriter.writeTrailer();
    }
  catch (Exception e) { e.printStackTrace(); }
  }
}

To read a sequence file for use in your program:

 
import java.io.*;
import iubio.bioseq.*;
import iubio.readseq.*;
class testrsq2 {
  public static void main(String[] args) {
    try {
      // your data goes here
    Object inputObject= new FileReader("myseq.embl");
    Readseq rd= new Readseq();
    String seqname= rd.setInputObject( inputObject );
      //Readseq.setInputObject() accepts many basic Java objects including 
      //  Reader, File, URL, InputStream, String, char[], byte[], 
      //  and an Enumeration of these objects
    System.err.println("Reading from "+seqname);
  
    if ( rd.isKnownFormat() && rd.readInit() )  {
      while (rd.readNext()) {
        BioseqRecord seqrec= new BioseqRecord(rd.nextSeq());

        // do something with seqrec....
        FeatureItem[] fits= seqrec.findFeatures( new String[]{"CDS", "intron"});
        if (fits==null) System.out.println("  No such features found.");
        else {
          System.out.println("  Extracted features and their sequence");
          for (int k= 0; k<fits.length; k++) try {
            System.out.println( fits[k]);
            Bioseq bseq= seqrec.extractFeatureBases( fits[k]);
            System.out.println(bseq); System.out.println();
            }
          catch (SeqRangeException sre) { System.out.println(sre.getMessage()); }
          }
        }
       } 
    } catch (Exception e) { e.printStackTrace(); }
  }           
}

To write a sequence file given your program has sequence and sequence document objects:

 
import java.io.*;
import java.util.Date;
import iubio.bioseq.*;
import iubio.readseq.*;
class testrsq3 {
  public static void main(String[] args) {
    try {
      // your data goes here
     Bioseq seq = new Bioseq( 
      "ccggagtgaggagcaacatgaactacgtgggactgggacttatcattgtgctgagctgcc");
     BasicBioseqDoc seqdoc= new BasicBioseqDoc("DMU57488");
     seqdoc.addDocField( seqdoc.kAccession, "U57488");
     seqdoc.addComment("Just a test sequence");
     seqdoc.addDate( new Date("February 11, 1997"));
     seqdoc.addSequenceStats( seq);
     seqdoc.addFeature("gene", new SeqRange("18..1684"));
     seqdoc.addFeatureNote("symbol", "est-6");
     seqdoc.addFeature("intron", new SeqRange("1405..1455"));
     BioseqRecord  seqrec= new BioseqRecord( seq, seqdoc);
   
       // now write to a file
     BioseqWriterIface writer= BioseqFormats.newWriter(
       BioseqFormats.formatFromName("XML"));
     writer.setOutput( new FileWriter("myseq.xml")) ;
     writer.writeHeader();
     if (writer.setSeq( seqrec)) writer.writeSeqRecord();
     else System.err.println("Failed to write " +seqrec); // or throw exception
     writer.writeTrailer();
     writer.close();
    } catch (Exception e) { e.printStackTrace(); }
  }           
}

See in the source package for further example programs:

testrsq.java
features.java
mergedocseq.java

Many of the format-specific parameters are set through .properties files, found in the source rez/ folder. A new format handler can be added by implementing these interfaces:

   iubio.readseq.BioseqReaderIface
   iubio.readseq.BioseqWriterIface

and then subclassing the iubio.readseq.BioseqFormat class, and adding the new iubio.readseq.BioseqFormat subclass to the list in rez/BioseqFormats.properties.

Source code

Availability

Find readseq source in ftp:, http://iubio.bio.indiana.edu/soft/molbio/readseq/java/ as readseq-source.zip

How to compile with Metrowerks Codewarrior

Unzip/unjar readseq-source.zip
(unzip -a readseq-source.zip to convert newlines)
Package contents are currently:
- rez/ -- Resource, properties files
- rseq.mcp.xml -- Metrowerks Codewarrior project definition
- src/ -- Java source code
- ibm-xml4j-min.jar -- basic XML SAX parser
- orgxml.jar -- XML class definitions
Import rseq.mcp.xml as the Codewarrior project.

How to compile with Sun javac

Follow step 1 above.
Find in src/split4javac.pl - a Perl program to convert multi-public java sources to single-public-named-same sources as required by javac.

Run split4javac.pl as:

    perl src/split4javac.pl -i rseq.mcp.xml -a src/split4javac.addlist -o splits/ >& log.split

    -- This reads project sources from .xml and creates output .java in split4/ 
    It can also be run from MacPerl/BBEdit

Compile with javac on unix as

  find splits -name "*.java" -exec \
    javac -classpath ibm-xml4j-min.jar:orgxml.jar -sourcepath splits -d splits {} \;

  -- if there are javac errors related to missing source classes, add the missing
     class to the src/split4javac.addlist and rerun split4javac.pl
     
  -- test as
   java -cp .:splits:ibm-xml4j-min.jar:orgxml.jar run

Release notes

Version 2.1.3 (07 June 2001) updates: 
  Added  -reverse option for reverse-complement of sequence
  Feature extraction of complement() locations now does reverse-complement
  Added feature subrange extractions:  
    -subrange=-1000..10    extract subrange of sequence for feature locations
    -subrange=1..end      
    -subrange=end-10..end+99  
            -1000...10 is 1000 bases upstream/before to 10 in feature,
             1..end is full feature range (default for no -subrange option)
             end-10..end+99 is 10 before end to 99 after end of feature
             only valid with -feat/-nofeat
    -extract=10000..99999   extract all features, sequence from given base range

  Added pair/unpair option for combining feature files and sequence files
      pair=1 myseq.gff myseq.fasta format=genbank 
          == means combine features + sequence in one output file
      unpair=1 myseq.genbank format=fff 
          == means output paired features, sequence files from one input
  Added  compare=1 option to test differences in sequence files

  Added ClustalW alignment, AceDB sequence formats
  Added FFF/Flat-Feature-Format (one-line DDBJ/EMBL/GenBank Feature Table)
  Added GFF/General-Feature-Format  
  Fixed EMBL to handle amino - SwissProt sequence format, GenBank to handle amino - GenPept
  A Perl script to convert readseq source to javac compatible form is included.
  Dropped obsolete/unsupported ZukerSeqFormat, OlsenSeqFormat, and reorganized format order (numbering)
  Various bug fixes; Java 1.2/3 compatibility

To do:
 -- Need to handle files in +300 MB range for genomes  
 -- add HTML output (feature hyperlinks? colorized features? )
 -- test -pair and -unpair seq ( .fff + .seq, .gff + .seq) merge options

Et cetera

Find other Java sources for bioinformatics applications at IUBio Archive. These include SeqPup, a biosequence editor and application framework, and gnomap, a genome map display program, both of which rely on Readseq for sequence formatting.