EMBL-Outstation - The European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK; 1Dept. of Computer and Information Science University of Pennsylvania, Philadelphia, PA 19104, USA, and 2Genome Sequencing Center, Washington University School of Medicine, Saint Louis, Missouri 63108, USA.
Corresponding Author:
Jeremy D. Parsons
EMBL-Outstation EBI,
Wellcome Trust Genome Campus,
Hinxton,
Cambridge,
CB10 1SD
UK
Tel. +44 1223 494665
Fax. +44 1223 494468
jparsons@ebi.ac.uk
The software is packaged as a Jar file available from: URL: http://www.ebi.ac.uk/~jparsons . Links to working examples of the trace viewers can be found at http://corba.ebi.ac.uk/EST. All the Washington University mouse EST traces are available for browsing at the same URL.
The use of CORBA in a biological context was introduced by Hu et al. (1998) and Lijnzaad et al. (1998) where they explain that CORBA can be a good solution to the problem of creating applications for distributed heterogeneous environments. The Internet is the extreme example of both distribution and heterogeneity and is described by Orfali and Harkey (1998) as being host to the "Object Web" where CORBA and JAVA complement each others abilities to create globally accessible interactive objects. Principal among the benefits that CORBA brings to biological data distribution and interaction, are: the Interface Definition Language (IDL) to define interfaces between objects, scalability (including language and Operating System independence), state-preservation across invocations, and a rich set of 15 object services e.g. the Naming Service.
A poignant example of where the special role of sequencing traces as the ultimate sequencing reaction reference has emerged from the combination of the recent release of the Phred basecalling program (Ewing and Green, 1998) and the status of the largest section of the public nucleotide databases: the EST sequences. ESTs are unusual in that they are submitted and published in a raw state with limited quality control (Hillier et al, 1996) and without the error detection and correction process intrinsic to normal shotgun assembly. Hillier et al. stressed the need for traces, to be available on-line globally and have therefore maintained an ftp site where traces can be downloaded since the beginning of their EST sequencing. Now that Ewing and Green have released Phred with its improved base calling (estimated to make 50% fewer errors than the original ABI basecaller) approximately 250,000 entries in the GenBank (Benson et al, 1998) and EMBL (Stoesser et al, 1998) nucleotide databases may be considered to be out of date and ripe for replacement whilst the original chromatograms remain available on-line and ready for re-interpretation at the originating laboratory.
Overall, the Internet is enabling decentralization within all areas of biological data access via the simplicity and low cost of HTML, the code portability of JAVA, and now the global middleware of CORBA. Though DNA sequence traces are collectively large, and scattered globally they are still important and following the same trends as other types of biological data: originally accessible via ftp, and now by JAVA applet over either HTML or the OMG's Internet Inter-ORB Protocol (IIOP) as described in this paper.
CORBA | Common Object Request Broker Architecture: A set of software standards and tools to act as middleware helping the creation and interaction of software components. Large programs being designed today may have so many dependencies and interconnections that they could become difficult to maintain. CORBA hides the complexity within each part of a program and simplifies the discovery and integration of groups of components needed to solve specific tasks. CORBA sits between components (hence the term middleware) and provides programming language independence, location independence and a set of commonly used services. |
Dynamic-HTML | A heterogeneous collection of technologies to make HTML pages less static and able to change once downloaded into a client web browser (includes JavaScript and cascading style sheets) |
JAVA | A modern, object-oriented, web-centric compiled programming language that is perhaps uniquely able to run on almost all computers from humble PCs up to mainframes. JAVA is a full, complex language that can be used to write any normal application on any computer though it may be slower than some alternatives. JAVA was designed with the Internet in mind and has a sophisticated security model allowing users to rapidly download software (applets) and run them locally with very litle risk. Programs running locally can be much more responsive than programs running across the Internet. |
JavaScript | A simple interpreted computer language that can run in a WWW browser where it can generate interactive HTML pages and control the content and checking of HTML objects like forms, applets and cookies. Originally not directly related to JAVA but they share a similar syntax, and complement each others abilities. Furthermore, a typical JavaScript implementation will have limited access to the public internals of JAVA applet classes embedded in any downloaded HTML page. |
Firewall | A firewall is commonly just a computer with two network cards running special software to monitor and control data going between a private intranet and the public Internet. The main purpose is usually to stop hackers damaging internal computers or gaining access to private data. |
IIOP | Internet Inter-ORB Protocol is the standardized way that ORBs from different vendors can talk to each other and so pass messages between clients and servers etc. |
IDL | Interface Definition Language is the specification language that is used to define the links between software components. A server will define in IDL all the objects and services it can offer whilst a client will use the IDL to learn both what a server can provide and how to ask for it. |
ORB | An Object Request Broker is the piece of software that does all the linking between components when a program is actually running. One part of the program might be running in one location, on one kind of computer, and be written in one language whilst the other component might be different in every respect but the ORB can still let them interact without either being aware of any intermediary conversion and navigation process. |
XML | Extensible Mark-up Language is a text based format for storing and sharing documents and data of any kind and so extends on HTML the language in which all existing WWW pages are written. HTML documents may contain text, graphics, JAVA applet classes or data for proprietary plug-ins but XML pages are completely extensible to any kind of content because XML acts as a meta language allowing the creation of specialised content languages for things not normally displayed in web pages such as scientific data, or database records. XML is a subset of the even more abstract SGML. |
Original Applet | The first applet written. Uses HTTP to transfer a single uncompressed trace file. Trace selection is via an HTML link, (one link needed for each trace). JAVA 1.0 |
Viewer Bean | Base object on which the JAVA 1.1 Viewers were built, follows JAVA Bean conventions. |
Local viewer | Direct file system (command line) load of compressed trace (i.e. not client/server). JAVA 1.1 |
CORBA Viewer Bean | CORBA wrapper around the Viewer Bean. Needs a CORBA naming service object reference to find the trace servers |
CORBA Applet/Application Client | IIOP download of traces: either compressed or uncompressed. Trace and database selection from a CORBA trace server. |
CORBA Server | Hides trace storage implementation. Offers access to multiple trace sets. Requires JAVA 1.1 and a CORBA 2.0 ORB |
The move from the original applet to the client/server design offers many benefits including: an object abstraction, the use of separate named trace stores, each with its own description, a choice of compressed, or uncompressed traces, and most importantly, the opportunity to generalize implementation details such as where and how a particular trace is stored to present a common database interface (see Figure 2). The separation of client and server communicating through an agreed interface is the cornerstone of CORBA distributed software design allowing concurrent use of different languages and operating systems and still allowing both clients and servers to improve implementations and add new features independently of each other. If, eventually, the original specification is found to be restrictive, new interfaces can be written and implemented whilst still supporting the old (unlike database schemas). The IDL language also allows inheritance so simple IDL specifications like that in Figure 2 can be extended to create more complex derived interfaces. Downloading, parsing and displaying a trace can take less than two seconds (from genome.wustl.edu in the USA to ebi.ac.uk in England) but may take more than five times longer when the Internet is congested (data not shown). Extra time is needed for an initial transfer of a JAVA ORB (if one is needed by the client). ABI format files are the largest and take the longest time to arrive. SCF format trace files (either version) are already many times smaller than the original ABI format trace and Version 3 SCF chromatograms can be compressed by gzip to less than 7% of the original file size. The SCF Version 3 format was specifically designed to be easily compressed see URL: http://www.mrc-lmb.cam.ac.uk/pubseq/manual/formats_2.html
The software can be downloaded as either the compiled class Jar files referenced from within the applet tags of any example applet pages (use view page source in Netscape) or as JAVA and IDL source Jar files as specified above. As an example requiring no JAVA or IDL compilation, nor a local ORB, one could use the local trace viewer to display the gzipped trace file mr32b07.r1.gz in the current directory with the command "java embl.ebi.trace.TraceView mr32b07.r1.gz" after the jar file containing all the chromatogram viewer classes is downloaded from an applet page and directly specified in the users local CLASSPATH environment variable. This jar file is typically called "CorbaChromatogramApplet.jar" and includes the "TraceView.class" file and all its supporting classes so setting the environment for a UNIX csh session would require some version of a command such as "setenv CLASSPATH= /home/myclasses/CorbaChromatogramApplet.jar".
CORBA may appear to be overkill for this simple interface specification (relative to sockets for example) but as more biological software components are written to CORBA standards, any extra individual server installation effort becomes reduced. The EMBL outstation European Bioinformatics Institute (EBI) is working towards standards for such components along with other members of the OMG's Life Science Research (LSR) Domain Task Force (DTF) (URL: http://lsr.ebi.ac.uk/). JAVA RMI would have been an interesting CORBA alternative but was not investigated due to the lack of relevant biological standards efforts, frameworks, language independence, services and local support.
module embl{ module ebi{ module trace{ typedef sequence <octet> fileFlow; // File Bytes typedef sequence <octet> qualities; // Unsigned 8-bit quality "AV" values exception InvalidID { string reason;}; interface TraceStore { // A human readable description of the contents of this database readonly attribute string description; boolean exists (in string ID); // Return basecall accuracy estimates if not inside trace qualities getQualities (in string ID) raises(InvalidID); // Return the ASCII representation of the nucleotide sequence string getBases (in string ID) raises(InvalidID); // The trace object is sent as a monolithic block, either compressed or not fileFlow getGZIPFile (in string ID) raises(InvalidID); // The server will normally store compressed so better to use getGZIPFile fileFlow getFile (in string ID) raises(InvalidID); }; }; }; };
The client/server CORBA system wraps the client classes inside a CORBA adapter class. This adapter translates from GUI generated trace load requests into CORBA method calls on a particular trace database server implementation via a CORBA naming service. The trace file parsing is easily done on the client because CPU cycles are plentiful, and the JAVA code transfer overhead is small (approximately 25% of the size of the smallest compressed trace). To optimize scalability and speed, the server can store and transfer traces as gzipped files which are easily handled by the Java.util.zip package in JAVA 1.1.
The CORBA trace server has all the implementation-specific methods for loading a trace (from a particular directory hierarchy or database) in a single class which can be over-ridden. The server configuration details including database names and descriptions are parsed from a simple text file which is read once when the server starts up. Multiple servers in different locations can register with a common naming service.
The original applet, being written to the older JAVA standard and using an ordinary http daemon as its download server is well suited to general Internet deployment where browsers versions may be out of date and for small sequencing centres where there are few traces for display and no local programming expertise. The implementation of JAVA 1.1 in Netscape Communicator 4.5 supports all the code described.