If you want to have this newsletter mailed to you or you want to make comments/suggestions about the format/content then send an email to acedb@sanger.ac.uk.
This month sees more work on the new "Genetic_code" class with support for the rare Selenocysteine added, more blixem/dotter work, an extension to the Query language, more work on the tree display find facility, LongText editting, and the usual bug fixes.
Last month
saw the addition of the new Genetic_code class to support specifying alternative genetic codes
to be used in translating DNA to peptides. This month support is added for a rare variant
of the genetic code found in C. elegans and perhaps other organisms. For at least
one transcript in C. elegans the triplet TGA
codes
for the peptide Selenocysteine
instead of its usual role as a stop codon. To ensure that acedb will translate this transcript
correctly
(i.e. will translate TGA as a "U" and not a "*") you need to set the new
Selenocysteine tag in the Genetic_code
class:
?Genetic_code Other_name ?Text
Translation UNIQUE Text
Start UNIQUE Text
Base1 UNIQUE Text
Base2 UNIQUE Text
Base3 UNIQUE Text
Selenocysteine Remark Text // Use the Remark to document your usage of this tag.
It is not known yet whether all such transcripts also use TGA as a stop codon as well, but acedb will as usual remove any trailing stop codon (including TGA) from the translation.
Note that way the existing Genetic_code
class is handled by acedb
allows you to specify any number of alternative codes within a single sequence hierachy
so no additional work by the curator is needed to add these rare variants other than
specifying the correct Genetic_code
object for that transcript.
A number of additions/fixes/features have been added to the blixem/dotter code:
EMBL sequence entries:
Now we have the basic mechanism in place we will add cut and paste to other text windows, if there are any that you would like this added to as a priority then email acedb@sanger.ac.uk
Blixem:
Dotter:
(Thanks to Dan Lawson dl1@sanger.ac.uk who inspired this idea and wrote the below and to Jean Thierry-Mieg mieg@ray.nlm.nih.gov who did the original coding for this new feature.)
ACEDB handles groups of objects in "Keysets" (you have seen the xace window termed 'Main KeySet' or the tace line '// 13 Active Objects'). We use Keysets all the time during an ACEDB session as transient lists of objects we are interested in but rarely save them to the database for later retrival/use.
Saved keysets can retrieved using standard queries on the KeySet class. In xace a keyset can be accessed by double-clicking on the appropriate name to open a new keyset containing the appropriate objects. In tace the process is slightly more complex requiring the user to use query/find to select the keyset object and then use the separate 'follow' command to de-reference the list of objects:
acedb> query find Keyset CDS_with_RNAi
// Found 1 objects
// 1 Active Objects
acedb> follow
// Follow without tag-name, enters active keySets
// Found 18068 objects
// 18068 Active Objects
acedb>
A major limitation of the tace behaviour is that you could not use the query language 'follow' keyword in the same way, i.e. the following would not work:
acedb> query find Keyset CDS_with_RNAi; follow
// ERROR - Nothing to the right of FOLLOW: find Keyset CDS_with_RNAi; follow
// Found 0 objects
// 0 Active Objects
acedb>
This is because the follow shown here is part of the query language and used
to navigate an object using tag names, i.e. "follow <tagname>".
A new "expand" keyword has been added to the
query language to allow this dereferencing of a keyset into
an object list as part of a more complex concatenated query.
You can now use stored keysets in the query language as follows:
acedb> query find Keyset CDS_with_RNAi; expand; where From_laboratory = HX;
// Found 8425 objects
// 8425 Active Objects
acedb> quit
This new feature will be used in the WormBase website (should be available later this month) to allow you to retrieve stored keysets, de-reference them and use them in queries within the single line query language search box on the website (http://www.wormbase.org/db/searches/query):
find Keyset CDS_with_RNAi; expand; where From_laboratory = HX;
This will return those members of the CDS_with_RNAi keyset which were predicted by the Sanger Institute (n.b. the keyset is stored in the database as part of the WormBase build process as a 'canned' query result).
Rob has made some further updates to the tree display search facility, below is his revised description.
(Thanks to Rob Clack rnc@sanger.ac.uk for this.)
Starting with release 4_9s, the old "Find Tag" or even older "Warp to Tag" function has been replaced by a new, more generic "Find" which locates any object you specify, whether a tag or not.
As with previous versions, you can use the right mouse button to drop down the menu, from which you can choose Find. However, the new version has various other ways of initiating the search.
Control/F and Control/B present you directly with the usual dialogue box for you to enter your target (ie skipping the menu), and pressing OK then searches forwards or backwards as you'd expect. The reverse search starts at the distal end of the tree and works backwards, of course.
Alternatively, if you can see the first of several identical things you want to find, you can click on that one, then use Next and Previous (see below) to move through the tree.
The search is case-insensitive and leading and trailing wildcards are supported. Thus B033* finds everything starting with B033, for instance.
For performance reasons, only the expanded portion of the tree is searched. Unfolding each node, searching it and folding it away was considered too costly to be worth implementing, and users we consulted said that in general they know the location of the objects they're seeking well enough to unfold the necessary nodes manually.
If the object can't be found, you get a warning message.
Once it has been located, you can get to the next and previous occurrences of your text using Control/. and Control/, buttons. I originally wanted to use Control/N and Control/P but the latter was already in use for printing, and I thought the appearance of the chevron keys was in keeping with the function, though I've not insisted you also press Shift to get there.
When you reach the end/beginning of the tree, you get a warning message, after which a further Next/Prev wraps the search back to the point at which it started.
Control/F and Control/B always give you back the Find dialogue box to allow you to change the object you're looking for, but note that it will start from the top of the tree rather than your current position if you do so.
Find also works in the UpdateTree display.
NEW HOTKEYS
In the TreeDisplay, you can shift directly to the UpdateTree using Control/U and then you can get back to the TreeDisplay using Control/T in UpdateTree.
Finally, this is old news but not everyone knows it. If you select a tag in TreeDisplay and go from there to the UpdateTree display, that tag will be centred on the screen for you.
Do you know how to edit LongText objects in xace ? Following a recent bug report from a user, a missing script from our acedb distribution has been replaced in wscripts which is necessary for the LongText editting to work.
If you start xace and then display an object in a treedisplay window which contains LongText objects you will see the LongText objects displayed in a numbered list at the bottom of the tree display window. To edit one of these objects you need to double click on the LongText objects name in this numbered list, this will display the LongText object in a special LongText display window. The LongText window gives you a number of options including "Edit", if you select this option then an editor will be displayed containing the the LongText text and you can then edit the text to change it. On saving the text and exitting the editor the new text will be reread into acedb as the new LongText object.
For all this to work you need to have a small script on your path which will start up the editor of your choice. Acedb provides a sample script in wscripts/acedb_editor:
#!/bin/ksh
#
# This script is called when you click the "Edit" button in the
# LongText display window (reached by double clicking on the
# name of any longtext displayed in the tree display).
# You should copy it to a directory on your PATH otherwise acedb
# will not be able to find it (N.B. make sure that its executable).
#
# If you haven't got ksh on your system then just adapt this script.
#
# Change "xemacs" to the editor of your choice.
#
#
donename="$1.done"
xemacs -q $1
cp $1 $donename
echo "***LongTextEnd***" >> $donename
exit 0
Blixem is one of the examples of acedb that relies on fixed length buffers
but does not check whether the buffers have overflowed, largely because the
standard C library for calls such as scanf()
do not allow checking
in any easy way.
This has been causing blixem to overwrite existing sequence records read from SEQBL files with more sequence when the sequence was too long to fit in Blixems internal buffers.
The bug has now been fixed by using a dynamic buffer for SEQBL file reading.
Mark unwittingly introduced a bug into catText() a couple of weeks ago while fixing an existing bug in catBinary(), there are a couple of things to note here I think:
In code where you register a callback function using the graphRegister() call for "PICK" (i.e. left button down), the callback function now takes an extra parameter which is the modifier key (Alt, Cntl etc) pressed at the time of the left button down. This allows you to handle key combintations such as "Cntl-left button" which just gives a bit more flexibility to the way we handle keyboard interactions, especially given the current effort to provide more keyboard shortcuts.
During the introduction of the special case Selenocysteine translation code I spent a bit of time rationalising the code that does the translations to make it easier to maintain. There was code in both fmapfeatures.c and fmapsequence.c which essentially duplicates translation code in peptide.c. All this code is now in peptide.c, there is probably further rationalisation within peptide.c that can be done but its a start to have it all in one file.
Two new directories have been created in wdoc as repositories for developer documentation:
I've been working on an adulteration of some emacs code that will allow us to put text like the following in our source code and then display the design document referred to in the comment (in this case "ServerDesign.html") in our own choice of browser:
/* If you alter this bit of code first read ServerDesign.html as there are some
* non-obvious bits of coding. */
When this is finished I'll put in cvs so we can all access it.
NOTE I don't think these directories are for fantastically finished documents, they should simply be a repository of what we have, please add to it when you can.
The default compilation of dotter has been changed to non-optimised. Compiling optimised by default makes debugging hard, its easy for any users to override this by changing the makefile as they wish.
(Thanks to Mark Sienkiewicz sienkiew@ncbi.nlm.nih.gov for this.)
Occasionally AceDB creates one off huge database/blockNN.wrm files (reaching sizes of 2GB and more) for "no apparent reason", there have been various theories about this but Mark has done the work to crack this problem. He has fixed the code in acedb 4_7 (Jeans working code) but the code needs importing into 4_9. Here is what he has to say:
Since then, I discovered the mechanism for Jean's problem of seeing the files go over 2 gig.When you allocate a block, it finds a free slot in the BAT (block allocation table). When there are no free slots, it appends 64 megabytes of data to the last data file and extends the BAT in memory. When you issue a "save" command, it writes the updated BAT into blocks in the data file.
BUT-- suppose tacembly crashes when you try to assemble genes from clones. It writes data to the database, which forces the file to be extended. BUT because it crashes before you save, the updated BAT is never written to the file.
Ok, it crashed, so you hack on it a little and run it again. It finds the most recent saved session, which includes the old BAT. It starts creating the same data again (remember the first run was never saved), and so again it overflows the old BAT and extends the file.
Now, it extends the file by appending 64 meg of data, not by writing 64 meg to the file beginning at the last block in the BAT. So if I had a 500 meg file that needs to extend by 500 meg, I have
500 meg before the first run 1000 meg after the first run crashes 1500 meg after the second run crashes 2000 meg after the third run crashes 2500 meg after the fourth run
Each time, the BAT still thinks there is only 500 meg in the file, so it believes it is ok to extend the file. Instead of just going 64 meg over the file size limit (which happens all the time), the file can grow arbitrarily large.
And, of course, when it goes over 2 gig, lseek() with signed 32 bit numbers starts failing, and the database crashes. When you run it again, it tries to make the file bigger...
b.t.w. I looked in your disknew.c, you do have this bug, but you probably never noticed it because you only see it if the database crashes between a big write and a save. Presumably, you could also see it if you parse a very large file and then quit without saving.
Nice one Mark !
You can pick up the monthly builds from: