ACEDB User Group Newsletter - June 2003

If you want to have this newsletter mailed to you or you want to make comments/suggestions about the format/content then send an email to acedb@sanger.ac.uk.


The June release includes the first extension to the Query language for some time, a new programmer's interface in C, Ace-C, reminiscent of AcePerl, but such that the same subroutines can be used in distant clients, as stand alone executables or embedded in tace/xace, an enhancement to the xace/xremote interface, tips for models file editting, bug fixes for gene making and acediff and a speed fix for the server.


General News

Jean Thierry-Mieg mieg@ncbi.nlm.nih.gov reports:

Very good news: Jean mieg@ncbi.nlm.nih.gov and Richard rd@sanger.ac.uk have decided that they will from now on try to contribute regularly to this newsletter, and encourage developers to use it as a forum to discuss planned or desirable enhancements of Acedb.

Very sad news: Mark Sienkiewicz, after one year of excellent work on acedb at NCBI, has left. In this short time, in collaboration with Jean, he has made several lasting contributions to acedb. He wrote the support for arbitrary genetic code, recursively hooked to the closest parent sequence. This code is compatible with the NCBI recognized set of genetic codes, and is therefore usable, modulo a simple declaration, for all organisms and organelles. It is currently used in WormBase to translate the mitochondrial genes. Mark then worked on tricky kernel issues, affecting the very large acedb database that Jean uses at NCBI to support the annotation of the complete human genome (4 million sequences, 16 million objects). He fixed in particular a rare disk runaway and accelerated the client/server TCP protocol. But his most important realization is his contribution to the design and implementation of the new Ace-C programmer's interface described below. "He went over to the dark side, but let us hope for the return of the Jedi".


New Features

Expanding Keysets in the ACEDB query language

(This article is courtesy of Dan Lawson dl1@sanger.ac.uk (who originally requested this new feature) and Jean Thierry-Mieg mieg@ncbi.nlm.nih.gov (who coded the new feature))

ACEDB handles sets of objects called Keysets (you have all seen the xace window termed 'Main KeySet' or the tace line '// 13 Active Objects'). We use Keysets all the time during an ACEDB session, as transient lists of objects of interest, and it is possible to save them for later retrieval/use, using the 'save as' button in the 'Select/Modif' menu of the keyset window. They then belong to the "Keyset" class and can be retrieved in xace from the main window, or using the query: find keyset toto.

The problem was to get the objects contained in the keyset. In graphic mode (xace), you click on the Keyset object and it expands in a new keyset window, which you can query, intersect, dump etc. In text mode (tace), you could only so far get inside the keyset using the tace command 'follow'. However, a major limitation of the tace behaviour was that this command was not part of the query language, so it could not be used in complex queries or from the WormBase interface.

The new operator, 'expand', fixes this problem in the way you expect. Let us consider a biological example. In each release of WormBase, there is a canned keyset called CDS_with_RNAi. You can now say, from the command line:

acedb> query find Keyset CDS_with_RNAi; expand; where From_laboratory = HX;

// Found 8425 objects
// 8425 Active Objects
acedb>
This is especially useful when you use the WormBase single line query search box:

find Keyset CDS_with_RNAi; expand; From_laboratory = HX;

(Note that you do not need to specify the leading "query" keyword when using the Wormbase query dialog.)

This query returns those members of the CDS_with_RNAi keyset which were predicted by the Sanger Institute.

In addition, since the expand operator is now part of the standard acedb query language, it can be used in more complex queries as follows:

Expand: query language operator.
Expand takes no parameter.
It opens the keysets present in the active set and returns as the new 
active set the union of all the objects that they contain.

examples: (at the tace prompt)

acedb> query find author; expand
returns an empty set, because authors are not keysets, but plain objects

acedb> query find keyset CDS_with_RNAi; expand
returns as the active keyset all objects contained in CDS_with_RNAi

acedb> query find keyset C* ; expand ; IS > b10 AND IS <= b100
returns all members of all keysets called C*; then filters those on 
their names. Recall that, in acedb, b8 < b10 < b20 < b22a < b100.
So you will get objects from b11* to b100, possibly including b22a.

acedb> query find keyset ; COUNT { expand } > 12 ; expand
returns a keyset of all keysets containing at least 12 objects, then
expands just those

Getting the xace main window id for use with xremote

To interact with a running xace program via xremote you need to know the window id of the xace main window. This is now provided along with other useful information in the "readlock" file for the running xace process.

Readlock files are held in the database subdirectory "<db>/database/readlocks" and have a file name of the format:

      <session>.<host>.<pid>

e.g.      5.grifbo.457076

Typical contents of a readlock are:

Readlock file to prevent destruction of sessions still in use by other processes
Created: 2003-07-02_13:45:46
User: edgrif
Program: xace
Version: ACEDB 4_9t
Linkdate: compiled on: Jul  1 2003 14:00:08
WindowID: 0x6c00028

Note the window id as the final entry.

Ace-C: a new C interface to acedb

(This article is courtesy of Jean Thierry-Mieg mieg@ncbi.nlm.nih.gov)

Mark Sienkiewicz and Jean Thierry-Mieg designed a new C interface to acedb. It is analog to AcePerl, but programmed as a C library. Like Janus, it has a double face. The same Ace-C code can run as a subroutine in tace, or as a distant client talking to the acedb server.

This is in fact the seventh interface developed for acedb, but we hope that, together with AcePerl, Ace-C will last. The problem at stake is to use acedb to support a Web site. We first had the Moulon server of the enigmatic Guy Decoux which was a clever hack rather than a programmer's interface. Then we had the -p style of Doug Bigwood et al. which exported full fledged Perl objects, but was hard to use. Then Lincoln and Jean developed the -j style to support first Jade, a Java universal interface to acedb, which died as a victim of the Java war, then AcePerl, the clear success we are all familiar with. In the meantime, another java interface, javace, was developed, but it is not universal and exports only a particular acedb schema into a particular biojava schema. There were also two C interfaces, one called acelib, which was designed collectively at the Cornell acedb meeting but was incomplete and clumsy, and a second one designed at NCBI, which was too complex and never released. By the way, we would like to remove acelib, so if anyone is using it, which I doubt, please let us know, we can help port the code to the new Ace-C.

Why so many attempts ? Well, clearly it is a hard problem, but each time we benefit from the previous tries. Why not just use AcePerl ? Because AcePerl modules can only be used as distant slow clients, and cannot be embedded as fast subroutines to develop new tools inside acedb. In fact AcePerl would benefit from being rewritten above the new Ace-C, which is better optimized than the current AcePerl communication layer and which offers new functionalities.

The real nice thing is that the Ace-C programmer's interface is defined in a single include file, wac/ac.h, and you may choose to link your program either with the client Ace-C lib, called libaccl.a or with the standalone Ace-C lib called libacs.a.

In the client case, data are exchanged with the server using the new 'show -C' mode, born from ideas discussed years ago with Richard and Doug Bigwood. In the standalone case, the Ace-C function calls are just wrappers for standard acedb kernel calls, which means that Ace-C runs with a negligible overhead, and can run standalone or as a subroutine imbedded in tace/xace. The interface is powerful, streamlined, unambiguous and well defined, and should be easy to learn and use.

Example (extracted from in wacext/makefile.acc):

kscount: kscount.o  $(LINK_ACC)
$(LINKER) -o $@ kscount.o $(LINK_ACC) $(LIBS)

skscount: kscount.o  $(LINK_ACS)
$(LINKER) -o $@ kscount.o $(LINK_ACS) $(LIBS)

kscount s:host:12345:anonymous   # runs as a client
skscount $ACEDB                  # runs standalone

Ace-C is now used in production underneath the NCBI AceView website http://www.humangenes.org All the source code has been put in the CVS, the library is well documented in wac/ac.h, although there is no user manual, and there is a built in regression suite. However, the rarely used functions may still need some tuning and we would appreciate feedback.

If you are interested, please use the acedb daily build and mail mieg@ncbi.nlm.nih.gov)

Named Keysets

(This article is courtesy of Jean Thierry-Mieg mieg@ncbi.nlm.nih.gov)

As part of the support for Ace-C, we introduced 3 new commands to the acedb command language, available from tace and aceclient.

kstore name  : store active keyset in internal tmp space as name
kget name  : copy named set as active keyset
kclear name  : clear named set

The name space and all the named keyset are per client and released when the client quits.

Example: Build a Wen diagram

   query ......... // a very slow query
kstore L1
query ......... // a very slow query
kstore L2
....
kget L1
spush
kget L2
sand
spop
count  // counts the objects in the intersect

I remember that Lincoln had for years been asking for this functionality.


Articles

Editting wspec/models.wrm

A couple of times recently users have sent emails with questions about the models file and how it should be edited. Here are tips/guidelines/rules.

The spatial layout of the models.wrm file is crucial as it describes the structure of objects in the database. Here is a crude example:

 Class in models.wrm               Object in database

?Name	First_name Text        Ed---->First_name--->"Edward"
Surname Text                      |
V
Surname---->"Griffiths"

If you misaligned the class definition as below acedb would complain and refuse to parse the models file:

?Name	First_name Text
Surname Text

This is fine in a way because you would quickly find out about the error and nothing will be corrupted. Mistakes can however be less obvious if you have a deeply nested object and you have used Tabs to space out your class. The accidental deletion of a Tab can change the whole branching structure of your class:

Original class:

?Map	No_cache // Don't cache segs for this map.
Display Non_graphic  // Prevents a graphic display!
Title UNIQUE ?Text
Flipped // Then coordinates go upwards
Unit UNIQUE Text 
// e.g. kb, centiMorgan, MegaParsec
Centre UNIQUE Float UNIQUE Float  
// default centre, width - else 0, 10
Extent UNIQUE Float UNIQUE Float  
// min, max - else min, max gene/locus
Default_view UNIQUE ?View
Minimal_view UNIQUE ?View // use this when >1 map displayed
View ?View                // Columns to display
Inherits  From_map UNIQUE ?Map 	// To locally edit
Author Text	// login name of who created it

Class after accidental deletion of Tab preceding "Centre":

?Map	No_cache // Don't cache segs for this map.
Display Non_graphic  // Prevents a graphic display!
Title UNIQUE ?Text
Flipped // Then coordinates go upwards
Unit UNIQUE Text 
// e.g. kb, centiMorgan, MegaParsec
Centre UNIQUE Float UNIQUE Float  
// default centre, width - else 0, 10
Extent UNIQUE Float UNIQUE Float  
// min, max - else min, max gene/locus
Default_view UNIQUE ?View
Minimal_view UNIQUE ?View // use this when >1 map displayed
View ?View                // Columns to display
Inherits  From_map UNIQUE ?Map 	// To locally edit
Author Text	// login name of who created it

This one character deletion changes the branching structure of the object completely legally to something quite different from the original. There are now three main branches in the object defined by the "Display", "Centre" and "Inherits" tags where there used to be two. While usage of both spaces and Tabs is completely legal, using Tabs is not a good idea !

If inadvertantly you read this wrong model file and start using it

To fix the mess, please use tace:

tace $ACEDB << EOF
query find map centre // gets just the 'new' centre
show -a -f toto.ace centre // exports just that tag
edit -D centre // removes the tag in the 'new' objects
read-models // read back the correct models
parse toto.ace  // recover your data in the right place
save
quit
EOF

Here are some tips that will help you to avoid having trouble editting the models.wrm file:


Bugs Fixed

Gene curation bug in FMap

A number of users have reported the error "Failed to update parent sequence object: " when trying to make temporary genes from within the gene finding package for the FMap. This has now been fixed and was a fault in the code that tried to find a parent for the temporary gene.

Blixem not displaying homologies that extend to end of sequence

A trivial (as in the coding, not in the debugging sense !) off-by-one error in coordinate calculations in blixem calling code resulted in the loss of all homologies which extend right to the end of the displayed sequence.

acediff bugs

acediff was erroneously interpretting "//" in a text field as the start of an acedb style comment and hence getting the text field wrong. This is now fixed, backslash escapes in comment delimiters are now honoured:

// this is a  comment
/\/ this is not (now) a comment

acediff had a problem processing files where the same tag was repeated due to clumsy pre-processing and generated spurious -D lines in the output, this is now fixed.

Printer list in print window

The printer list in the print window used to contain blank lines if the system file /etc/printcap contained incomplete entries. This is now fixed.

makefile shenanigans

Replaced "test" with "/usr/bin/test" since the test builtin in some versions of the shell doesn't understand "-L" -- step forward Solaris!


Developers Corner

Performance and acedb

(This article is courtesy of Jean Thierry-Mieg mieg@ncbi.nlm.nih.gov)

This could be interesting to all programmers. I recently fixed heavy performance problems just by editing my makefile:

  1. I had an apparent memory leak in the acedb server underneath my web-site, affecting Solaris on sparc or on Intel. The leak went away when I removed from my makefile the pragma -l malloc
  2. On Linux, I considerably accelerated parsing files by removing a pragma -l pthread, although I was actually not using threads, acedb is NOT thread save, the library was listed in my configuration files, because of other programs, and this turned the getc(), used by the acedb parser to scan input files, from a macro into a very slow function. It is possible that this library is linked implicitly in many C++ implementations, slowing down many of your programs, beware!
  3. Several months ago, qsort on Solaris was bugged, retrieving a negative record at a very low but non zero probability, and inducing very soon after that a flat crash. I fixed that by replacing in arraySort the call to qsort by an actually faster call to msort, a function that I added in freesubs. Conclusion: If you have any trouble with performance, do not forget to suspect the standard Unix functions, such as qsort, and standard libraries, such as lmalloc and lpthread. Also remember that I am always interested in hearing about performance problems (mieg@ncbi.nlm.nih.gov).

About catText catBinary

(This article is courtesy of Jean Thierry-Mieg mieg@ncbi.nlm.nih.gov)

Let us re-explain what happened last month. Since its introduction, catBinary was bugged, it was appending an extra zero at the end of the buffer. This was apparent only if you were calling catBinary() twice on the same stack, because you then had this extra zero in between the two pages. Nobody ever complained, may be nobody ever tried. This bug was fixed by Mark Sienkiewicz around May 25.

While fixing the bug, catText was inadvertently modified. This killed the query language, was detected by Aquila the same night, and fixed the following day. So, just to clarify:

Server speed up

(Thanks to Mark Sienkiewicz (formerly of NCBI) for spotting this one.)

Requests/replies can be greatly speeded up for the server by setting the TCP_NODELAY flag for the socket. This prevents TCP from waiting for an acknowledgement for each packet before it sends the next one. This is particularly relevant for small requests on slow networks (the effect is hardly noticeable on a local high speed network):

Jean Thierry-Mieg mieg@ncbi.nlm.nih.gov reports:

Mark Sienkiewicz greatly speeded up the server by setting the TCP_NODELAY flag for the socket. This prevents TCP from waiting for an acknowledgement for each packet before it sends the next one. The improvement is particularly important for small requests on slow networks, as when one uses AcePerl over the network across the Atlantic ocean. The way Mark found the bug is interesting. Because it seemed too hard to retrofit the ace.4_9 TCP protocol (let us call it the s-protocol) in the ace.4_7 code still used in production at NCBI, he wrote, in the context of Ace-C, a new simple minded TCP protocol (let us call it the a-protocol). Later the Ace-C client was adapted to talk to the ace.4_9 saceserver, and it became immediately apparent that the s-protocol was way slower than the a-protocol, even on the NCBI intranet. After some testing, it all boiled down to raising a single flag, and speeds were equalized.

Optimisations

(This article is courtesy of Jean Thierry-Mieg mieg@ncbi.nlm.nih.gov)

With data size ever increasing, it becomes necessary to optimize the code once more. To this end, i revamped the acedb CHRONO and started pushing fixes in the CVS.

  1. A glitch in the hashing function used by the dict library was fixed. This does not affect the acedb main classes, which have their own direct hashing. The improvement is visible only if you use a dict with over 100,000 entries.
  2. I modified disk/cache handling with a tremendous effect on very large databases. To parse the 16 million objects of the whole human genome annotation database on an alpha, time went down from 24 to 12 hours.
  3. The problem of disk runaways has been found by Mark Sienkiewicz but needs to be incorporated into the development code.
  4. File scanning optimization is under way, and will be put in the CVS shortly.
  5. Several types of query now use the internal indexing more efficiently.

Chrono

(This article is courtesy of Jean Thierry-Mieg mieg@ncbi.nlm.nih.gov)

Chrono is back and working. It is a very versatile way of profiling acedb. It is programmed as macros, and does not affect the compiled code when turned off.

Usage:
#define CHRONO       /* if undef, all chrono calls disappear */
#include "chrono.h"   /* must come after the define CHRONO */
...
f ()
{
chrono ("f") ;      /* or any message you like */
.........            /* do something */
chronoReturn () ;   /* carefully chronoReturn along each path*/ 
/* 
beware of goto's and early return statements */
}
...
main()
{
chronoStart () ;
chronoStop() ;
chronoReport () ;
}

The nice thing is that you can chrono a whole function or just a few lines of code. Nested chrono calls are allowed.

To start and stop a chrono session, there are four methods

Example comparing acein and freecard on the same dataset:

// Total time : 4.42 s  system,  66.43 user  level = 0
//           # of calls   System       %          User       %
//   Chrono        0      0.00 s      0 %         0.00 s     0 %
//test_acein       1      0.65 s     14 %        14.50 s    21 %
//test_freecard    1      0.92 s     20 %        13.93 s    20 %


June monthly build now available.

You can pick up the monthly builds from:

Sanger users
~acedb/RELEASE.DEVELOPMENT
External users
http://www.acedb.org/Software/Downloads/monthly.shtml


Next User Group Meeting - D319, 3.00pm, Thursday, 10th July 2003



Ed Griffiths <edgrif@sanger.ac.uk>
Last modified: Tue Jul 8 10:46:24 BST 2003