Authors: Jeremy D. Parsons, Eugen Buehler¹, LaDeana Hillier²

EMBL-Outstation - The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK; ¹Dept. of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104, USA, and ²Genome Sequencing Center, Washington University School of Medicine, Saint Louis, Missouri 63108, USA.

Corresponding Author:
Jeremy D. Parsons
EMBL-Outstation EBI,
Wellcome Trust Genome Campus,
Hinxton,
Cambridge,
CB10 1SD
UK

Tel. +44 1223 494665
Fax. +44 1223 494468

jparsons@ebi.ac.uk

DNA Sequence Chromatogram Browsing

Using JAVA and CORBA

Abstract

DNA sequence chromatograms (traces) are the primary data source for all large-scale genomic and Expressed Sequence Tags (EST) sequencing projects. Access to the sequencing trace assists many later analyses, for example contig assembly and polymorphism detection, but obtaining and using traces is problematic. Traces are not centrally collected and published, they are much larger than the basecalls derived from them, and viewing them requires the interactivity of a local graphical client with local data. To provide efficient global access to DNA traces, we developed a client/server system based on flexible JAVA components integrated into other applications including an applet for use in a WWW browser and a stand-alone trace viewer. Client/server interaction is facilitated by CORBA middleware which provides a well defined interface, a naming service and location independence.

The software is packaged as a Jar file available from: URL: http://www.ebi.ac.uk/~jparsons . Links to working examples of the trace viewers can be found at http://corba.ebi.ac.uk/EST. All the Washington University mouse EST traces are available for browsing at the same URL.

Introduction

Biological Information Distribution

The Internet is host to an increasingly diverse range of mechanisms for biological data distribution. Two of the latest are the World-Wide Web (WWW) standards set by the W3C (URL:http://www.w3.org/), which are already well established amongst biologists, and the Object Management Group (OMG) Common Object Request Broker Architecture (CORBA) which is relatively new in this field (URL:http://www.omg.org). The existing WWW standards have the advantage of simplicity and broad availability however frequent extensions to HTML and additions such as JavaScript, XML and dynamic HTML (all described at http://www.hotwired.com/webmonkey/collections/crash_courses.html) have tested the WWW browser developers and the users ability to keep up. The incorporation of JAVA applets (http://java.sun.com) into HTML documents has further tested the maintenance of common standards as JAVA itself has undergone rapid change. However JAVA's basic combination of security, portability, and desirability as a programming language have ensured the inclusion of a JAVA Virtual Machine (JVM) into the major WWW browsers where it greatly increases the potential for client interactivity. In 1996, the ease with which client/server object-oriented applications could be written, distributed, and supported across the Internet increased when Netscape (http://www.netscape.com/) announced (Orfali and Harkey, 1997) that its browsers were all going to include a CORBA Object Request Broker (ORB), and when it subsequently decided to distribute its browsers for free.

The use of CORBA in a biological context was introduced by Hu et al. (1998) and Lijnzaad et al. (1998) where they explain that CORBA can be a good solution to the problem of creating applications for distributed heterogeneous environments. The Internet is the extreme example of both distribution and heterogeneity and is described by Orfali and Harkey (1998) as being host to the "Object Web" where CORBA and JAVA complement each others abilities to create globally accessible interactive objects. Principal among the benefits that CORBA brings to biological data distribution and interaction, are: the Interface Definition Language (IDL) to define interfaces between objects, scalability (including language and Operating System independence), state-preservation across invocations, and a rich set of 15 object services e.g. the Naming Service.

Sequence Chromatograms

DNA sequence chromatograms are interpreted to produce nucleotide sequences (basecalling) and corresponding base call quality estimates but whilst these derived views are more commonly used, the traces remain the ultimate reference source for any queries about that particular sequencing reaction. All commonly used sequence assembly packages (for example Bonfield at al, 1995), include proprietary trace browsers to help the user (finisher) distinguish poor quality data from good and so work backwards to recreate a representation of the original sequence. Furthermore, in regions with either trace artifacts specific to a particular sequencing chemistry, or general background contamination, an experienced finisher might be able to diagnose correctly the underlying problem and provide a better basecall when provided with a suitable view of the original trace. As with contig assembly, trace availability can increase the success rate of STS development from ESTs by enabling an optimal estimation of the possible positions of basecalling errors. Mott, (1998) explored more of the direct uses of sequence traces through trace alignment including the identification of vector sequence (better than other automated methods), and detection and analysis of polymorphisms/mutations. Examples of existing trace viewers and editors include Ted (Gleeson and Hillier, 1991), Consed (Gordon et al. 1998), and Trev (URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/trev_toc.html)

A poignant example of where the special role of sequencing traces as the ultimate sequencing reaction reference has emerged from the combination of the recent release of the Phred basecalling program (Ewing and Green, 1998) and the status of the largest section of the public nucleotide databases: the EST sequences. ESTs are unusual in that they are submitted and published in a raw state with limited quality control (Hillier et al, 1996) and without the error detection and correction process intrinsic to normal shotgun assembly. Hillier et al. stressed the need for traces, to be available on-line globally and have therefore maintained an ftp site where traces can be downloaded since the beginning of their EST sequencing. Now that Ewing and Green have released Phred with its improved base calling (estimated to make 50% fewer errors than the original ABI basecaller) approximately 250,000 entries in the GenBank (Benson et al, 1998) and EMBL (Stoesser et al, 1998) nucleotide databases may be considered to be out of date and ripe for replacement whilst the original chromatograms remain available on-line and ready for re-interpretation at the originating laboratory.

Overall, the Internet is enabling decentralization within all areas of biological data access via the simplicity and low cost of HTML, the code portability of JAVA, and now the global middleware of CORBA. Though DNA sequence traces are collectively large, and scattered globally they are still important and following the same trends as other types of biological data: originally accessible via ftp, and now by JAVA applet over either HTML or the OMG's Internet Inter-ORB Protocol (IIOP) as described in this paper.

Mini-Glossary

CORBA	Common Object Request Broker Architecture: A set of software standards and tools to act as middleware helping the creation and interaction of software components. Large programs being designed today may have so many dependencies and interconnections that they could become difficult to maintain. CORBA hides the complexity within each part of a program and simplifies the discovery and integration of groups of components needed to solve specific tasks. CORBA sits between components (hence the term middleware) and provides programming language independence, location independence and a set of commonly used services.
Dynamic-HTML	A heterogeneous collection of technologies to make HTML pages less static and able to change once downloaded into a client web browser (includes JavaScript and cascading style sheets)
JAVA	A modern, object-oriented, web-centric compiled programming language that is perhaps uniquely able to run on almost all computers from humble PCs up to mainframes. JAVA is a full, complex language that can be used to write any normal application on any computer though it may be slower than some alternatives. JAVA was designed with the Internet in mind and has a sophisticated security model allowing users to rapidly download software (applets) and run them locally with very litle risk. Programs running locally can be much more responsive than programs running across the Internet.
JavaScript	A simple interpreted computer language that can run in a WWW browser where it can generate interactive HTML pages and control the content and checking of HTML objects like forms, applets and cookies. Originally not directly related to JAVA but they share a similar syntax, and complement each others abilities. Furthermore, a typical JavaScript implementation will have limited access to the public internals of JAVA applet classes embedded in any downloaded HTML page.
Firewall	A firewall is commonly just a computer with two network cards running special software to monitor and control data going between a private intranet and the public Internet. The main purpose is usually to stop hackers damaging internal computers or gaining access to private data.
IIOP	Internet Inter-ORB Protocol is the standardized way that ORBs from different vendors can talk to each other and so pass messages between clients and servers etc.
IDL	Interface Definition Language is the specification language that is used to define the links between software components. A server will define in IDL all the objects and services it can offer whilst a client will use the IDL to learn both what a server can provide and how to ask for it.
ORB	An Object Request Broker is the piece of software that does all the linking between components when a program is actually running. One part of the program might be running in one location, on one kind of computer, and be written in one language whilst the other component might be different in every respect but the ORB can still let them interact without either being aware of any intermediary conversion and navigation process.
XML	Extensible Mark-up Language is a text based format for storing and sharing documents and data of any kind and so extends on HTML the language in which all existing WWW pages are written. HTML documents may contain text, graphics, JAVA applet classes or data for proprietary plug-ins but XML pages are completely extensible to any kind of content because XML acts as a meta language allowing the creation of specialised content languages for things not normally displayed in web pages such as scientific data, or database records. XML is a subset of the even more abstract SGML.

Results

A JAVA trace viewing applet originally written by Buehler (see Figure 1) has been developed into a set of trace viewing tools with each component filling a different software niche. The tools work with different versions of the JAVA Virtual Machine, are packaged as JAVA applets, applications, and JAVA Beans and operate as either CORBA client/server systems, or stand-alone applications.

Fig. 1

Screen shots of the original applet embedded in a web page (showing F17D4-Sp6), and the CORBA trace server client displaying a Washington University Genome Sequencing Center (WUGSC) EST trace (yx99h12.s1). Both clients offer scrolling, automatic scaling, and view selection of either the chromatogram, the called bases, or comments describing the conditions of the electrophoresis.

View of the Original Java1.1 Applet

Design

The design choices were influenced by many factors including, most importantly, the fact that the majority of DNA sequence traces are normally stored in individual files in one of only three formats ABI, SCF V2, or SCF V3. Most sequencing machine's proprietary formats are convertible to the Standard Chromatogram File (SCF) formats (Dear and Staden, 1992), a process helped by the Staden group's provision of freely available SCF libraries (ftp://ftp.mrc-lmb.cam.ac.uk/pub/staden/src/) This lack of flat file diversity enabled both the HTTP-based and CORBA-based trace viewing clients to share much of the same code and also allowed a focus on scalability and download speed for the server design.

Table 1. Summary of the different programs

Original Applet	The first applet written. Uses HTTP to transfer a single uncompressed trace file. Trace selection is via an HTML link, (one link needed for each trace). JAVA 1.0
Viewer Bean	Base object on which the JAVA 1.1 Viewers were built, follows JAVA Bean conventions.
Local viewer	Direct file system (command line) load of compressed trace (i.e. not client/server). JAVA 1.1
CORBA Viewer Bean	CORBA wrapper around the Viewer Bean. Needs a CORBA naming service object reference to find the trace servers
CORBA Applet/Application Client	IIOP download of traces: either compressed or uncompressed. Trace and database selection from a CORBA trace server.
CORBA Server	Hides trace storage implementation. Offers access to multiple trace sets. Requires JAVA 1.1 and a CORBA 2.0 ORB

The move from the original applet to the client/server design offers many benefits including: an object abstraction, the use of separate named trace stores, each with its own description, a choice of compressed, or uncompressed traces, and most importantly, the opportunity to generalize implementation details such as where and how a particular trace is stored to present a common database interface (see Figure 2). The separation of client and server communicating through an agreed interface is the cornerstone of CORBA distributed software design allowing concurrent use of different languages and operating systems and still allowing both clients and servers to improve implementations and add new features independently of each other. If, eventually, the original specification is found to be restrictive, new interfaces can be written and implemented whilst still supporting the old (unlike database schemas). The IDL language also allows inheritance so simple IDL specifications like that in Figure 2 can be extended to create more complex derived interfaces. Downloading, parsing and displaying a trace can take less than two seconds (from genome.wustl.edu in the USA to ebi.ac.uk in England) but may take more than five times longer when the Internet is congested (data not shown). Extra time is needed for an initial transfer of a JAVA ORB (if one is needed by the client). ABI format files are the largest and take the longest time to arrive. SCF format trace files (either version) are already many times smaller than the original ABI format trace and Version 3 SCF chromatograms can be compressed by gzip to less than 7% of the original file size. The SCF Version 3 format was specifically designed to be easily compressed see URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/formats_2.html

The software can be downloaded as either the compiled class Jar files referenced from within the applet tags of any example applet pages (use view page source in Netscape) or as JAVA and IDL source Jar files as specified above. As an example requiring no JAVA or IDL compilation, nor a local ORB, one could use the local trace viewer to display the gzipped trace file mr32b07.r1.gz in the current directory with the command "java embl.ebi.trace.TraceView mr32b07.r1.gz" after the jar file containing all the chromatogram viewer classes is downloaded from an applet page and directly specified in the users local CLASSPATH environment variable. This jar file is typically called "CorbaChromatogramApplet.jar" andincludes the "TraceView.class" file and all its supporting classes so setting the environment for a UNIX csh session would require some version of a command such as "setenv CLASSPATH= /home/myclasses/CorbaChromatogramApplet.jar".

Discussion

There are currently a few problems deploying CORBA based applications over the Internet. These problems include: old firewalls blocking the IIOP protocol, the need to download ORB classes to clients due to the lack of a guaranteed local ORB, and a lack of support for multiple applet signing which would allow applets to follow object references to objects on computers other than the original applet's host. There are already solutions to all these problems but their degree of irritation should regardless decrease with the release of JDK1.2 from Javasoft (URL:http://www.javasoft.com/) with its high performance class libraries and built-in JAVA ORB. When all Operating Systems (OS) support this rich environment, which includes OMG CORBA support, distributed computing may move further out of the browser and directly into more of a user's normal molecular biology application set.

CORBA may appear to be overkill for this simple interface specification (relative to sockets for example) but as more biological software components are written to CORBA standards, any extra individual server installation effort becomes reduced. The EMBL outstation European Bioinformatics Institute (EBI) is working towards standards for such components along with other members of the OMG's Life Science Research (LSR) Domain Task Force (DTF) (URL: http://lsr.ebi.ac.uk/). JAVA RMI would have been an interesting CORBA alternative but was not investigated due to the lack of relevant biological standards efforts, frameworks, language independence, services and local support.

Future options

The trace viewer is limited by its isolation: only when more CORBA servers are developed to support applications such as EST clustering, sequence assembly etc. will the synergies of CORBA wrapped data become obvious. The CORBA trace server will move to the new CORBA 3 standard, which supports fully portable (between different vendor's ORBs) server code as soon as practicable. The client should benefit from extra interactive features such as quality value display and editing, external trace view positioning interfaces and multiple trace views.

Methods

All the software is written in JAVA and compiled using Sun Microsystems/Javasoft's Javac JAVA compilers (http://www.javasoft.com/). The simplest applet (also the first written applet) complies with the JAVA 1.0 standard but the remainder of the code requires JAVA 1.1 class libraries. The IDL interface specifications were compiled by Object Oriented Concepts' (URL http://www.ooc.com/) ORBacus IDL to JAVA compiler. Many ORBs and IDL compilers are available free (URL:http://industry.ebi.ac.uk/~corba/) as are Sun's JAVA compilers. Documentation is distributed throughout the code in Javadoc comments.

Fig.2. The DNA trace database IDL interface specification

module embl{
module ebi{
module trace{

        typedef sequence <octet> fileFlow;      // File Bytes
        typedef sequence <octet> qualities;     // Unsigned 8-bit quality "AV" values

        exception InvalidID { string reason;};

        interface TraceStore {
                // A human readable description of the contents of this database
                readonly attribute string description;
                boolean exists (in string ID);

                // Return basecall accuracy estimates if not inside trace
                qualities getQualities (in string ID)
                        raises(InvalidID);

                // Return the ASCII representation of the nucleotide sequence
                string getBases (in string ID)
                        raises(InvalidID);

                // The trace object is sent as a monolithic block, either compressed or not
                fileFlow getGZIPFile (in string ID)
                        raises(InvalidID);
        
                // The server will normally store compressed so better to use getGZIPFile
                fileFlow getFile (in string ID)
                        raises(InvalidID);

                };
        };
};
};

Implementation

The three trace formats are parsed by subclasses of an abstract Chromatogram class. The Chromatogram class visualisation code is in a separate ChromatogramCanvas class to keep display and user-interactivity methods separate from the basic chromatogram object model. The client canvas uses double buffering to reduce flicker when scrolling. The chromatogram display can be switched to display ASCII base calls, or the ABI sequencing machine's comments field.

The client/server CORBA system wraps the client classes inside a CORBA adapter class. This adapter translates from GUI generated trace load requests into CORBA method calls on a particular trace database server implementation via a CORBA naming service. The trace file parsing is easily done on the client because CPU cycles are plentiful, and the JAVA code transfer overhead is small (approximately 25% of the size of the smallest compressed trace). To optimize scalability and speed, the server can store and transfer traces as gzipped files which are easily handled by the Java.util.zip package in JAVA 1.1.

The CORBA trace server has all the implementation-specific methods for loading a trace (from a particular directory hierarchy or database) in a single class which can be over-ridden. The server configuration details including database names and descriptions are parsed from a simple text file which is read once when the server starts up. Multiple servers in different locations can register with a common naming service.

The original applet, being written to the older JAVA standard and using an ordinary http daemon as its download server is well suited to general Internet deployment where browsers versions may be out of date and for small sequencing centres where there are few traces for display and no local programming expertise. The implementation of JAVA 1.1 in Netscape Communicator 4.5 supports all the code described.

Acknowledgments

The authors are grateful to Tom Flores for helping to start the CORBA element of this project. Rodger Staden's group, especially James Bonfield, helped with advice and support. This work was funded by EU grant BIO 4 CT 960346

References

Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., and Ouellette, B.F.F., 1998, GenBank, Nucleic Acids Res., 26: 1-7
Bonfield, J.K., Smith, K.F., and Staden R., 1995, A new DNA sequence assembly program. Nucleic Acids Res., 24: 4992-4999
Dear, S and Staden, R., 1992, A standard file format for data from DNA sequencing instruments, DNA Sequence 3: 107-110
Ewing, B. and Green, P., 1998, Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Res., 8: 186-194
Gleeson, T. and Hillier, L., 1991, A trace display and editing program for data from fluorescence based sequencing machines, Nuc. Acids Res.19: 6481-6483
Gordon D., Abajian C., and Green Phil, (1998) Consed: A Graphical Tool for Sequence Finishing. Genome Res., 8: 195-202
Hillier L., Lennon G. Becker, M., Bonaldo, M.F., Chiapelli, B., Chissoe, S., Dietrich, N., DuBuque, T., Favello, A., Gish, W., Hawkins, M., Hultman, M., Kucaba, T., Lacy,M., Le, M., Le, N., Mardis, E., Moore, B., Morris, M., Parsons, J., Prange, C., Rifkin, L., Rohlfing, T., Schellenberg, K., Bento Soares, B., Tan, F., Thierry-Mieg, J., Trevaskis, E., Underwood, K., Wohldman, P., Waterston, R., Wilson, R., Marra, M. (1996), Generation and Analysis of 280,000 Human Expressed Sequence Tags. Genome Res., 6: 807-828
Hunkapiller, T., Kaiser, R.J., Koop, B.F., and Hood, L., 1991, Large-Scale and Automated DNA Sequence Determination, Science, 254: 59-67
Lijnzaad, P., Helgesen, C., and Rodriguez-Tomé, P., 1998, The Radiation Hybrid Database, Nucleic Acids Res., 26: 102-105
Hu,J., Mungall, C., Nicholson,D., and Archibald,A.L., 1998, Design and implementation of a CORBA-based genome mapping system prototype, Bioinformatics,14: 112-120.
Mott, R., 1998, Trace alignment and some of its applications, Bioinformatics, 14, 92-97
Orfali,R., and Harkey,D., 1997, Client/Server Programming with JAVA and CORBA. John Wiley & Sons, New York, N.Y. 10158-0012.
Orfali,R., and Harkey,D., 1998, Client/Server Programming with JAVA and CORBA. 2nd Edition, John Wiley & Sons, New York, N.Y. 10158-0012.
Stoesser, G., Moseley, M.A., Sleep, J., McGowran, M., Gracia-Pastor, M., and Sterk, P., 1998, The EMBL Nucleotide Sequence Database, Nucleic Acids Res., 26: 8-15