Readseq Extended Help


Readseq was written originally around 1989 a component of a sequence analysis program, in Pascal, but when I added a small, simple command-line interface, it took on a life of its own as a conversion program for bioinformatics. It's main contribution to bioinformatics is it takes on the job of guessing what your input biosequence data format is, and converting it to what your software knows how to handle.

It was converted to a C program in early 1990's and after an update in 1993, remained as is for several years, as I wrote around it and moved thru C++ then on to Java as a primary bioinformatics language. During this time, I'd wanted often to teach readseq to handle sequence documentation; the original ignored all but a few fields of information other than sequence data. In late 1990's my sequence analysis program SeqPup was in its Java incarnation and needed this, especially handling of feature annotations, locating genes, introns and such in a sequence.

After slowly updating readseq from a haphazard C program to an object oriented structure in Java, I pulled it back out of its parent source to become again a stand-alone program for format conversion. This release version 2, first available in 1999, continues support for the "classic" C version, in that it supports the same command-line options, but has extensions for sequence documentation, feature table and other additions, plus new sequence format conversions, and a lot of bug fixing. This java version is also more efficient, working faster than the compiled C classic version. It still isn't efficient enough to handle large sequences (genome sized or full GenBank/EMBL data release files).

In its current Java incarnation, interfacing Readseq with other languages is done mainly through command-line calls to the main program. If your programs are in Perl, you may want to use the collection with its SeqIO package.

This software is freely available to the public for use. The author, Don Gilbert, of Readseq and the Java package 'iubio.readseq' does not place any restriction on its use or reproduction. Developers are encourged to incorporate parts in their programs. I would appreciate being cited in any work or product based on this material. This software is provided without warranty of any kind.

-- Don Gilbert, May 2001
Bioinformatics group, Biology Department & Cntr. Genomics & Bioinformatics,
Indiana University, Bloomington, Indiana

General reformatting options

Graphic User Interface options

This version includes a graphic user interface for those who prefer not to learn the many command line options, or who's workstation lacks a command-line interface (which probably includes most but the hardy bioinformaticians :).

The call format 'java -cp readseq.jar app' starts this simple graphic user interface, based on Java Swing. For MacOS, see the ReadseqApp driver program which does the equivalent to this commandline call. The menu options in this 'app' interface cover the equivalent of the command line options (see below).

The steps to use this are

  1. open sequence files or URLs,
  2. select output file,
  3. select an output format from the Output format popup menu,
  4. select any output options such as feature restrictions,
  5. process files
Progress of the processing is listed in the Message window.

File menu:

About Readseq...
- display some help documentation.
Open sequence files...
- open sequence files from selector dialog;
Open URLs to sequence...
- open sequence data from Internet URL (http://, and probably ftp:// work)
Save as...
- Select output file for converted sequences
Clear list of open sequences
- This removes any opened files from list to process. Files are not converted until 'Process' is chosen, but you can add any number of files to process with the 'Open' command before hand.
Process selected files
- Convert 'Open' sequences to selected format, saving to 'Save' file
Options menu:
List sequences only

Verbose progress report

Calculate checksum of sequences

Change seq to lower case

Change seq to UPPER CASE

No case change

Remove gap symbols

Reverse-complement sequence

Translate input symbol(s) to output symbol(s)

Set range of sequence,features to extract
Extract data from a given base range, e.g. 500..10000
No feature table selection

Extract sequence of selected features...
Extract sequence based on features, only where input data includes feature tables.
Remove sequence of selected features...
Remove sequence based on features, only where input data includes feature tables.
Set subrange of selected features
Subrange location of extracted features (e.g., -1000..10 for upstream bases; end..100 for downstream)
XML - include document type definition

Pretty print menu:
These options set parameters for the Pretty-print format, such as numbering and labels.

Command Line Interface options

  Read & reformat biosequences, Java command-line version
Usage: java -cp readseq.jar run [options] input-file(s)
For more details: java -cp readseq.jar help more

-a[ll] select All sequences
'all' will cause processing of all sequences (default now for version 2, for
compatibility with version 1). Use items=1,2,3 to select a subset.

-c[aselower] change to lower case
'caselower' and 'CASEUPPER' will convert sequence case

-degap[=-] remove gap symbols
'degap=symbol' will remove this symbol from output sequence (- normally)

-f[ormat=]# Format number for output, or
-f[ormat=]Name Format name for output
see Formats list below for names and numbers
'format=genbank', 'format=gb', 'format=xml', et cetera to select an output
format. You can also use format number (below), but these numbers may change
with revisions. Alternate names of formats are listed below (e.g.,
'Pearson|FastA|fa' allows 'pearson', 'fasta', or 'fa' as name. This is

-inform[at]=# input format number, or
-inform[at]=Name input format name. Assume input data is this format
'inform=genbank' lets you specify data input format. Normally Readseq
guesses the input format (usually correctly). Use this option if you wish
to bypass this input format guessing.

-i[tem=2,3,4] select Item number(s) from several
'items=2,3,4' will select these sequence records from a multisequence
input file.

-l[ist] List sequences only
'list' will list titles of sequence records

-o[utput=]out.seq redirect Output
'output=file', send output to named file.

-p[ipe] Pipe (command line, < stdin, > stdout)
'pipe' will cause input data to come from STDIN and output go to
STDOUT unix standard files (unless -out is given and input file given),
and no prompting or progress reports will occurr.

-r[everse] reverse-complement of input sequence
'reverse' will write the sequence from end to start, and DNA bases are
complemented. Amino residues are not complemented.

-t[ranslate=]io translate input symbol [i] to output symbol [o]
use several -tio to translate several symbols
translates given sequence bases, e.g. -tAN to change 'A' to 'N'

-v[erbose] Verbose progress
'verbose' will print some progress reports

-ch[ecksum] calculate & print checksum of sequences

Documentation and Feature Table extraction:

-feat[ures]=exon,CDS... extract sequence of selected features
-nofeat[ures]=repeat_region,intron... remove sequence of selected features
'feature=CDS,intron' lets you specify those features to extract, or remove,
in the output. Currently this causes each feature to produce a new
sequence record. Use the '

-field=AC,ID... include selected document fields in output
-nofield=COMMENT,... remove selected document fields from output

See below for subrange option
-subrange=-1000..10 * extract subrange of sequence for feature locations

-extract=10000..99999 extract all features and sequence from given base range

See below for pair, unpair options
-pair=1 * combine features (fff,gff) and sequence files to one output
-unpair=1 * split features,sequence from one input to two files

Pretty format options:
-wid[th]=# sequence line width
-tab=# left indent
-col[space]=# column space within sequence line on output
-gap[count] count gap chars in sequence numbers
-nameleft, -nameright[=#] name on left/right side [=max width]
-nametop name at top/bottom
-numleft, -numright seq index on left/right side
-numtop, -numbot index on top/bottom
-match[=.] use match base for 2..n species
-inter[line=#] blank line(s) between sequence blocks

This program requires a Java runtime (java or jre) program, version 1.1.x, 1.2 or later
The leading '-' on option is optional if '=' is present. All non-options
(no leading '-' or embedded '=') are used as input file names.
These options and call format are compatible with the classic readseq (v.1992)
* New experimental feature handling options, may not yet work as desired.

To test readseq with a built-in data set, use: java -cp readseq.jar test

Common Gateway Interface options

This version includes a Common gateway interface (CGI) for use in web servers. The call format 'java -cp readseq.jar cgi' starts this. To use in a web server on a unix system, install a shell script like this as '/cgi-bin/readseq.cgi':
env > ${envtemp}
/usr/java/bin/java -cp readseq.jar cgi env=${envtemp}
/bin/rm ${envtemp}
The options in this 'cgi' interface include all of the command line options (see below). The 'env & ${envtemp}' is used to pass CGI environment variables to readseq. Included in the readseq.jar rez/ folder is the CGI form presented as 'cgiform.html'. If you wish to customize this, extract it from readseq.jar, edit, and put into the classpath ahead of readseq.jar, as
  -- where this directory contains reaseq.jar and readseq.cgi --
jar -xf readseq.jar rez/cgiform.html
-- edit rez/cgiform.html --
/usr/java/bin/java -cp .:readseq.jar cgi env=${envtemp}

The options for using readseq cgi are same as for readseq command line, with a few additions (see iubio.readseq.cgi java source).

Feature handling

Version 2 added document and feature table parsing, which has become an essential need in sequence manipulation. Release 2.1 further enhances and debugs this aspect.

One can currently extract sequence of a given set of features from a biosequence with feature tables.

  Option: feature=gene,mRNA,repeat (feat for short)

-- produce embl file gene features only
java -cp readseq run format=embl out=flygenes.em feat=gene,mRNA
Updates include extraction of proper strand orientation. E.g. extracting a feature with location 'complement(join(1..10,20..30))' produces the reverse-complement of bases 1..10 and 20..30
In development are methods to extract subranges of features, and produce output records for each or selected features, to let one, for instance, extract sequence in a range upstream/downstream from gene features.
Feature subrange extractions are accomplished with
  Option:  subrange=#start..#stop

-subrange=-1000..10 extract subrange of sequence for feature locations
-1000...10 is 1000 bases upstream/before to 10 in feature,
1..end is full feature range (default for no -subrange option)
end-10..end+99 is 10 before end to 99 after end of feature
only valid with -feat/-nofeat
Sequence Range Math

Readseq handles subranges as the intersection with a given feature location. Subrange math respects 'complement()' orientation, as well as the 'end' value to specify end of a given location. Subranges can be a join() of several ranges. The subrange is intersected with a given feature range, and extended beyond its ends as specified. Here are some subrange computation examples:

subrange(-100..10) -- 100 bases upstream of feature to 10 bases inside.
subrange(1..end1) -- the full range of the feature.
subrange(end1..end100) -- from the end to 100 bases downstream of the feature.
subrange(-100..end100) -- 100 bases upstream to 100 bases downstream.
subrange(join(-100..10,end1..end100)) -- 100 bases upstream to 10 into,
joined with the end to 100 bases downstream.

Test SeqRange.subrange()
10..60 & subrange(-100..10) = -91..19
10..60 & subrange(1..end1) = 10..60
10..60 & subrange(end1..end100) = 60..159
10..60 & subrange(-100..end100) = -91..159
10..60 & subrange(join(-100..10,end1..end100)) = join(-91..19,60..159)

complement(10..60) & subrange(-100..10) = complement(51..161)
complement(10..60) & subrange(1..end1) = complement(10..60)
complement(10..60) & subrange(end1..end100) = complement(-89..10)
complement(10..60) & subrange(-100..end100) = complement(-89..161)
complement(10..60) & subrange(join(-100..10,end1..end100)) = complement(-89..10,51..161)

join(1..5,20..40,55..60) & subrange(-100..10) = -100..5
join(1..5,20..40,55..60) & subrange(1..end1) = join(1..5,20..40,55..60)
join(1..5,20..40,55..60) & subrange(end1..end100) = 60..159
join(1..5,20..40,55..60) & subrange(-100..end100) = join(-100..5,20..40,55..159)
join(1..5,20..40,55..60) & subrange(join(-100..10,end1..end100)) = join(-100..5,60..159)

complement(1..5,20..40,55..60) & subrange(-100..10) = complement(55..161)
complement(1..5,20..40,55..60) & subrange(1..end1) = complement(1..5,20..40,55..60)
complement(1..5,20..40,55..60) & subrange(end1..end100) = complement(-98..1)
complement(1..5,20..40,55..60) & subrange(-100..end100) = complement(-98..5,20..40,55..161)
complement(1..5,20..40,55..60) & subrange(join(-100..10,end1..end100)) = complement(-98..1,55..161)

join(1000..1500,2000..2500,3000..3100) & subrange(-100..10) = 899..1009
join(1000..1500,2000..2500,3000..3100) & subrange(1..end1) = join(1000..1500,2000..2500,3000..3100)
join(1000..1500,2000..2500,3000..3100) & subrange(end1..end100) = 3100..3199
join(1000..1500,2000..2500,3000..3100) & subrange(-100..end100) = join(899..1500,2000..2500,3000..3199)
join(1000..1500,2000..2500,3000..3100) & subrange(join(-100..10,end1..end100)) = join(899..1009,3100..3199)

complement(1000..1500,2000..2500,3000..3100) & subrange(-100..10) = complement(3091..3201)
complement(1000..1500,2000..2500,3000..3100) & subrange(1..end1) = complement(1000..1500,2000..2500,3000..3100)
complement(1000..1500,2000..2500,3000..3100) & subrange(end1..end100) = complement(901..1000)
complement(1000..1500,2000..2500,3000..3100) & subrange(-100..end100) = complement(901..1500,2000..2500,3000..3201)
complement(1000..1500,2000..2500,3000..3100) & subrange(join(-100..10,end1..end100)) = complement(901..1000,3091..3201)

Biosequence Formats

Known Formats

ID Name Read Write Int'leaf Features Sequence Suffix Content-type
1 IG|Stanford yes yes -- -- yes .ig biosequence/ig
2 GenBank|gb yes yes -- yes yes .gb biosequence/genbank
3 NBRF yes yes -- -- yes .nbrf biosequence/nbrf
4 EMBL|em yes yes -- yes yes .embl biosequence/embl
5 GCG yes yes -- -- yes .gcg biosequence/gcg
6 DNAStrider yes yes -- -- yes .strider biosequence/strider
7 Fitch -- -- -- -- yes .fitch biosequence/fitch
8 Pearson|Fasta|fa yes yes -- -- yes .fasta biosequence/fasta
9 Zuker -- -- -- -- yes .zuker biosequence/zuker
10 Olsen -- -- yes -- yes .olsen biosequence/olsen
11 Phylip3.2 yes yes yes -- yes .phylip2 biosequence/phylip2
12 Phylip|Phylip4 yes yes yes -- yes .phylip biosequence/phylip
13 Plain|Raw yes yes -- -- yes .seq biosequence/plain
14 PIR|CODATA yes yes -- -- yes .pir biosequence/codata
15 MSF yes yes yes -- yes .msf biosequence/msf
16 ASN.1 -- -- -- -- yes .asn biosequence/asn1
17 PAUP|NEXUS yes yes yes -- yes .nexus biosequence/nexus
18 Pretty -- yes yes -- yes .pretty biosequence/pretty
19 XML yes yes -- yes yes .xml biosequence/xml
20 BLAST yes -- yes -- yes .blast biosequence/blast
21 SCF yes -- -- -- yes .scf biosequence/scf
22 Clustal yes yes yes -- yes .aln biosequence/clustal
23 FlatFeat|FFF yes yes -- yes -- .fff biosequence/fff
24 GFF yes yes -- yes -- .gff biosequence/gff
25 ACEDB yes yes -- -- yes .ace biosequence/acedb

ID is a number that can be used for this format (name is prefered). Alternate names are separated by '|'. You can use any of these to specify a format. Read and Write specify whether Readseq can read and write this format. Int'leaf means the format is interleaved. Features says if sequence record documentation and features are parsed. Sequence indicates if the format contains sequence data. Content-type is the magic string sent for that format thru a CGI web server. The suffix is the standard file suffix used for that format.

Feature Table-only Formats

These are increasingly useful when working with genome annotations, as the features are stored separately from sequence data (which is typically in fasta or raw format). The most widely used feature table syntax is the DDBJ/GenBank/EMBL Feature Table format, used in their respective flat file data records. The Gene-Finding/General Feature Format (GFF, from Sanger Centre) is also increasingly common for genome annotation work. Various flavors of XML are also used for feature annotations of sequence data.

Readseq v 2.1 can merge feature-only and sequence-only files to produce feature+sequence extracted outputs.

  Option: pair-feature-seq (pair for short) 
Option: unpair-feature-seq (unpair for short)

-- produce embl file from feature,sequence pair
java -cp readseq run pair=1 format=embl out=flypair.em features-Xtop.tsv dna-Xtop.raw

-- produce embl file of subrange from gene features using feature,sequence pair
java -cp readseq run pair=1 format=embl out=flypairex.em feat=gene,mRNA subrange=-1000..10 features-Xtop.tsv dna-Xtop.raw

-- produce fff,fasta file pair from embl sequence
java -cp readseq run unpair=1 format=fff out=flyunpair.fff flypair.em

The pair option causes input sequence files to be checked and read as a pair of feature,sequence files. The unpair option will split an input feature+sequence file into paired feature, sequence files.

Other Formats

Efficiency and Limitations

Release 2 in Java of readseq has been completely revised to object-oriented and efficient data handling. It is faster than release 1 which was compile C code, by a factor of 2 to 4 times. This isn't because Java yet matches native compiled C code for speed (though it is approaching it), but because the original Readseq was not efficient.

Readseq 2 appears faster in limited testing than Bioperl's SeqIO Perl modules, by roughly a factor of 2 (using Java 1.3 and Perl 5.005 on a Sun Solaris system). Release 2 is substatially improved over release 1, and should provide more reliable results as well as the many improvements in documentation and feature handling. It retains command-line compatibility with release 1, so should be a drop-in replacement.

Readseq is currently not recommended for very large (100+MB) sequence files, whether as a single record or multiple records. Hopefully this limit will be lifted in coming months to enable use with genome-sized sequences and feature tables.

The main limitation is memory usage - it is not optimized for large data sets, but reads all of a sequence record in memory, generating numerous objects for documentation and a byte array for sequence.

Programming with Readseq

Readseq started out as a component of a sequence analysis program, but when I added a simple command-line interface, it took on a life of its own as a conversion program for bioinformatics.

Readseq started out as a component of a sequence analysis program, and continues as a component tool, which I use in various other programs. For use with other Java programs, add the readseq.jar to your library class path, then call Readseq public entries.

Here is one basic variant to convert A to B formats:

// compile as: javac -classpath readseq.jar:$CLASSPATH
// run as: java -cp .:readseq.jar testrsq inputfile(s)...
import iubio.readseq.*;
class testrsq {
public static void main(String[] args) {
try {
int outid= BioseqFormats.formatFromName("fasta");
BioseqWriterIface seqwriter= BioseqFormats.newWriter(outid);
seqwriter.setOutput( System.out);
Readseq rd= new Readseq();
for (int i=0; i<args.length; i++) {
rd.setInputObject( args[i] );
if (rd.isKnownFormat() && rd.readInit())
rd.readTo( seqwriter);
catch (Exception e) { e.printStackTrace(); }

To read a sequence file for use in your program:

import iubio.bioseq.*;
import iubio.readseq.*;
class testrsq2 {
public static void main(String[] args) {
try {
// your data goes here
Object inputObject= new FileReader("myseq.embl");
Readseq rd= new Readseq();
String seqname= rd.setInputObject( inputObject );
//Readseq.setInputObject() accepts many basic Java objects including
// Reader, File, URL, InputStream, String, char[], byte[],
// and an Enumeration of these objects
System.err.println("Reading from "+seqname);

if ( rd.isKnownFormat() && rd.readInit() ) {
while (rd.readNext()) {
BioseqRecord seqrec= new BioseqRecord(rd.nextSeq());

// do something with seqrec....
FeatureItem[] fits= seqrec.findFeatures( new String[]{"CDS", "intron"});
if (fits==null) System.out.println(" No such features found.");
else {
System.out.println(" Extracted features and their sequence");
for (int k= 0; k<fits.length; k++) try {
System.out.println( fits[k]);
Bioseq bseq= seqrec.extractFeatureBases( fits[k]);
System.out.println(bseq); System.out.println();
catch (SeqRangeException sre) { System.out.println(sre.getMessage()); }
} catch (Exception e) { e.printStackTrace(); }

To write a sequence file given your program has sequence and sequence document objects:

import java.util.Date;
import iubio.bioseq.*;
import iubio.readseq.*;
class testrsq3 {
public static void main(String[] args) {
try {
// your data goes here
Bioseq seq = new Bioseq(
BasicBioseqDoc seqdoc= new BasicBioseqDoc("DMU57488");
seqdoc.addDocField( seqdoc.kAccession, "U57488");
seqdoc.addComment("Just a test sequence");
seqdoc.addDate( new Date("February 11, 1997"));
seqdoc.addSequenceStats( seq);
seqdoc.addFeature("gene", new SeqRange("18..1684"));
seqdoc.addFeatureNote("symbol", "est-6");
seqdoc.addFeature("intron", new SeqRange("1405..1455"));
BioseqRecord seqrec= new BioseqRecord( seq, seqdoc);

// now write to a file
BioseqWriterIface writer= BioseqFormats.newWriter(
writer.setOutput( new FileWriter("myseq.xml")) ;
if (writer.setSeq( seqrec)) writer.writeSeqRecord();
else System.err.println("Failed to write " +seqrec); // or throw exception
} catch (Exception e) { e.printStackTrace(); }

See in the source package for further example programs:

Many of the format-specific parameters are set through .properties files, found in the source rez/ folder. A new format handler can be added by implementing these interfaces:
and then subclassing the iubio.readseq.BioseqFormat class, and adding the new iubio.readseq.BioseqFormat subclass to the list in rez/

Source code


This software is freely available to the public for use. The author, Don Gilbert, of Readseq and the Java package 'iubio.readseq' does not place any restriction on its use or reproduction. Developers are encourged to incorporate parts in their programs. I would appreciate being cited in any work or product based on this material. This software is provided without warranty of any kind.

Find readseq source in ftp:, as

How to compile with Metrowerks Codewarrior

  1. Unzip/unjar
    (unzip -a to convert newlines)
  2. Package contents are currently:
  3. Import rseq.mcp.xml as the Codewarrior project.

How to compile with Sun javac

  1. Follow step 1 above.
  2. Find in src/ - a Perl program to convert multi-public java sources to single-public-named-same sources as required by javac.
  3. Run as:
        perl src/ -i rseq.mcp.xml -a src/split4javac.addlist -o splits/ >& log.split

    -- This reads project sources from .xml and creates output .java in split4/
    It can also be run from MacPerl/BBEdit
  4. Compile with javac on unix as
      find splits -name "*.java" -exec \
    javac -classpath ibm-xml4j-min.jar:orgxml.jar -sourcepath splits -d splits {} \;

    -- if there are javac errors related to missing source classes, add the missing
    class to the src/split4javac.addlist and rerun

    -- test as
    java -cp .:splits:ibm-xml4j-min.jar:orgxml.jar run

Release notes

Version 2.1.3 (07 June 2001) updates: 
Added -reverse option for reverse-complement of sequence
Feature extraction of complement() locations now does reverse-complement
Added feature subrange extractions:
-subrange=-1000..10 extract subrange of sequence for feature locations
-1000...10 is 1000 bases upstream/before to 10 in feature,
1..end is full feature range (default for no -subrange option)
end-10..end+99 is 10 before end to 99 after end of feature
only valid with -feat/-nofeat
-extract=10000..99999 extract all features, sequence from given base range

Added pair/unpair option for combining feature files and sequence files
pair=1 myseq.gff myseq.fasta format=genbank
== means combine features + sequence in one output file
unpair=1 myseq.genbank format=fff
== means output paired features, sequence files from one input
Added compare=1 option to test differences in sequence files

Added ClustalW alignment, AceDB sequence formats
Added FFF/Flat-Feature-Format (one-line DDBJ/EMBL/GenBank Feature Table)
Added GFF/General-Feature-Format
Fixed EMBL to handle amino - SwissProt sequence format, GenBank to handle amino - GenPept
A Perl script to convert readseq source to javac compatible form is included.
Dropped obsolete/unsupported ZukerSeqFormat, OlsenSeqFormat, and reorganized format order (numbering)
Various bug fixes; Java 1.2/3 compatibility

To do:
-- Need to handle files in +300 MB range for genomes
-- add HTML output (feature hyperlinks? colorized features? )
-- test -pair and -unpair seq ( .fff + .seq, .gff + .seq) merge options

Et cetera

Find other Java sources for bioinformatics applications at IUBio Archive. These include SeqPup, a biosequence editor and application framework, and gnomap, a genome map display program, both of which rely on Readseq for sequence formatting.