AQL - internals

Part 1 : Parsing

An AQL query is sumbitted as plain text that contains keywords, database indentifiers (e.g. names of tags and classes) and language specific tokens (e.g. ->, ::, []) which are translated into an internal representation of the query for further processing.

The lex/yacc system is used to generate program fragments that build the AQL parser. This strategy allowed us to build a comprehensive parser in a well structured and almost self-documenting manner, as the source-files for lex/yacc closely relate the documentation of the AQL syntax).
The drawback is the inablity of a lex/yacc generated parser to present meaningful error-messages to the user in the event of a syntax error. However, errors messages reported during the preprocessing stage are much more explanatory.

The lexer

The source file for lex is aqlparse.l. It defines all language keywords, operators and lexical rules for identifiers, numeral constants and date values.

The interface to the lex/yacc system isn't threadsafe and the C-module which is comprised of lex and yacc-generated code has a global static variable aql_L which holds all global status of the query that the various code-snippets access.

The API to lex/yacc is quite restrictive and its POSIX standard requires that the input has to be a FILE* pointer. Various loopholes in the API allowed us to read the lex input directly from a text-buffer containing the query-string. Note that those loopholes are different in AT∓T based lex implemetations and the GNU lex-clone flex. See the comments at the top of aqlparse.l for details.

Furthermore, the limited API only allows the programmer to keep track of the line-number of each token. This is very restrictive for AQL queries which in most cases constist only of one line of text. For better error-reporting the array tokPos[] is used to keep track of the position of a token in the query-string. Errors reported in the preprocessing stage can therefore point to the specific word in the querystring that was responsible for the error. (see aqlerror.c:aqlError())

The yacc parser

The source file for yacc is aqlparse.y. It defines the language syntax in BNF grammar and contains the code sippets that build the parse-tree.

The parse-tree is made up from nodes of various types wich are enumerated by the enum-type AqlNodeType in in the header-file aql_.h. The is no direct relationship between the enum-identifiers used in aqlparse.l and the node-types. The token-identifiers used by the lexer generally correspond directly to the keyword that they represent, whereas the identifiers of the node-types decribe the language construct.

The initial parse-tree

The utility taql is a test-bed for aql queries and provides a few hooks into the AQL internals than tace which allows nothing but the input of a query-text. The most useful function is the ability to turn on the debugging flags in the AQL kernel.

% taql -d 2 $ACEDB

The debug level 2 will output a textual parse-tree, which is useful to study how the query is processed at each stage. Let's study an example of a parse-tree at the parsing stage, i.e. the tree directly constructed by aqlparse.y

This example query will be converted into the following parse-tree :

select a, a->paper from a in class author where a.name like "m*"

TABLE_SFW          1405c0460
 left   FROM_LOC name "a"               1404da3a0
  left   CLASS name "author"            1404da440
  right  BOOL_COMPARISON op like                1405c03c0
   left   LOC_VAR name "a"              1404da4e0
   next   LOC_METHOD name "name"                1404da580
   right  TEXT m*               1405c0320
 right  SELECT_FIELD            1404da120
  left   LOC_VAR name "a"               1404c7380
 next   SELECT_FIELD            1404da300
  left   LOC_VAR name "a"               1404da1c0
  next   LOC_FOLLOW_TAG_NAME name "paper"               1404da260

The nodes in the tree are linked by pointers left, right and next and the tree is traversed recursivly in that order. The memory addresses in italics may be ignored for now, they can be used when nodes-structures are inspected in the debugger. The words in bold are the names of each node as declared by the AqlNodeType enumerator in aql_.h

The graphical representation of the debug-output would look like this :

The graphical tree has 4 levels, that correspond to the 4 indentation levels of the debug output. The types of nodes that are allowed for the left, right and next pointers varies for each type and implies a different meaning depending on each type of node.

At this stage the parse tree is still very unorganised and needs to be pre-processed before it can be executed to produce the query-results.

Part 2 : Preprocessing

The parse-tree is quite dramatically altered at this stage. The goal is to turn the parsetree into a canonical form which can be executed and is guaranteed to be syntactically and semantically correct. The pre-processing is divided into 2 main parts, which are performed by the functions aqlcheck.c:aqlCheck1() and aqlrun.c:aqlCheck2(). During check1 the parse-tree is converted into a canonical form ready for query execution and check2 accesses the database to verify the validity of identifiers such as class-names and tag-names.

An important concept to the internals of the AQL engine is the locator. A locator is a data-carrier for certain nodes, which provides valuable linkage information across the tree. In the debug-ouput below, one can see, that the LOC_VAR "a" is mentioned multiple times in the tree, but careful examination of the locator address reveals that all these node carry the same locator (loc=1405d2e20). Therefore the state of the variable "a" will be the same whichever part of the parse is looking at it, e.g. a.name and a->paper will eventually only have to eveluate "a" once, and then the result of the name-method and the dereferncing to the tag "paper" will be based on the same value for "a".
The assignment of locators to certain nodes is a vital part of the work done during pre-processing.

Check1> :

The debug output of the canonical parse-tree after semantic pre-processing :

TABLE_SFW                1405c0460
 left   FROM_LOC name "a"               1404da3a0 (loc=1405d2e20)
  left   CLASS name "author"            1404da440
 next   FROM_LOC name "_1"              1405d1660 (loc=1405d2ec0)
  left   LOC_VAR name "a"               1405d1700 (loc=1405d2e20)
  next   LOC_METHOD name "name"         1404da580
 next   FROM_LOC name "_2"              1405d17a0 (loc=1405d2f60)
  left   LOC_VAR name "a"               1405d1840 (loc=1405d2e20)
  next   LOC_FOLLOW_TAG_NAME name "paper"               1404da260
 next   FROM_LOC name "_3"              1405d18e0 (loc=1405d30a0)
  left   LOC_VAR name "_2"              1405d1980 (loc=1405d2f60)
  next   LOC_LOCAL_POS number 1         1405d1520
  right  BOOL_COMPARISON op like                1405c03c0
   left   LOC_VAR name "_1"             1404da4e0 (loc=1405d2ec0)
   right  TEXT m*               1405c0320
 right  LOC_VAR name "a"                1404c7380 (loc=1405d2e20)
 next   LOC_VAR name "_3"               1404da1c0 (loc=1405d30a0)

The graphical representation of the debug-output would look like this :


The parse-tree has gained a lot of new nodes and some nodes have even been lost. The guts of the query have been converted to a series of FROM_LOC nodes which are reminiscent of the from keyword. They are the declarations of internal iterators. The user has already declared one iterator ("a") in the from-clause of the query. This iterator is class-based and will therefore loop over all objects in the class "Author".
The other iterators (with names like "_1") have been added by the check1 pre-processor.

The reader is reminded of the convention on positions in an Acedb-model. From every tag you can move a certain number of steps to the right to retrieve a specific data-field. In AQL the default tag-position is 1, i.e. the expression a->paper really means a->paper[1] and will therefore return the data-element which is one position to the right of the tag "paper" in an Author-object. The tag-position 0 refers to the tag itself. The pre-processor will create some extra tag-positiong nodes where the user has not specified one.

The SELECT_FIELD nodes have disappeared, after some processing they are no longer required, because the node to the right of a TABLE_SFW node automatically identifies itself as a select-field. The 2 select-fields are preserved as LOC_VAR nodes. This query will report 2 columns with the values of the locators "a" and "_3" at each iteration cycle. Where did the "_3" variable come from ?

The variable with names that start with an underscore "_" are internal variables. They could have been declared by the user, but usually the query is presented in a somewhat abbreviated form. Internally, the query generates 4 results :

  1. the iterator "a" itself ( "a" )
  2. the method "a.name" ( "_1" )
  3. the tag itself "a->paper[0]" ( "_2" )
  4. the datafield to its right "[1]" ( "_3" )
One can now see from the select-fields what the 2 output columns will contain. The internal iterators "_1" and "_2" can be thought of as hidden columns. They aren't requested by the select-cause and will therefore not be reported but nevertheless must be computed.

So, all the internal iterators become a linked list of FROM_LOC nodes attached to the left of the TABLE_SFW node, which is the root-node. If a variable is declared to be based on a table-variable (e.g. select a:1 from a in @results) rather than a class, those iterators are FROM_TABLE nodes.

The where-clause was parsed straight into the appropriate BOOL_... node by the yacc-parser without the need of intermediate WHERE_... nodes (like those needed for the SELECT_FIELDs). The where-clause is represented by a node attached to the right of a FROM_LOC. This right node represents a condition on the from-clause. In the example the FROM_LOC called "_3" is only worth reporting if the expression to its right evaluates TRUE.

The operands of the where-clause expression can only refer to previously declared LOC_VARs. That's why the previous declaration of FROM_LOC "_1" is required. The BOOL_COMPARISON node is fairly straight forward - it's operator is the like function which compares the left node's value to the right node's value. If TRUE, the locator "_3" is valid and the LOC_VAR node "_3" in the select-clause will therefor report the result.

The parse-tree now represents the query in a canonical form which contains all the internal declarations needed for the query execution.

Check2 : The parse-trees is checked against the database. It also determines how the internals locators will be executed. This is the debug output of the parse-tree after aqlCheck2() :

TABLE_SFW                1405c0460
 left   FROM_LOC name "a"               1404da3a0 (loc=1405d2e20, IS_CLASS)
  left   CLASS name "author"            1404da440
 next   FROM_LOC name "_1"              1405d1660 (loc=1405d2ec0, IS_METHOD)
  left   LOC_VAR name "a"               1405d1700 (loc=1405d2e20, IS_CLASS)
  next   LOC_METHOD name "name"         1404da580
 next   FROM_LOC name "_2"              1405d17a0 (loc=1405d2f60, IS_FOLLOW_TAG)
  left   LOC_VAR name "a"               1405d1840 (loc=1405d2e20, IS_CLASS)
  next   LOC_FOLLOW_TAG name "paper"            1404da260
 next   FROM_LOC name "_3"              1405d18e0 (loc=1405d30a0, IS_LOCAL_POS)
  left   LOC_VAR name "_2"              1405d1980 (loc=1405d2f60, IS_FOLLOW_TAG)
  next   LOC_LOCAL_POS number 1         1405d1520
  right  BOOL_COMPARISON op like                1405c03c0
   left   LOC_VAR name "_1"             1404da4e0 (loc=1405d2ec0, IS_METHOD)
   right  TEXT m*               1405c0320
 right  LOC_VAR name "a"                1404c7380 (loc=1405d2e20, IS_CLASS)
 next   LOC_VAR name "_3"               1404da1c0 (loc=1405d30a0, IS_LOCAL_POS)

The parse-tree has largely unchanged, but the locators have been assigned types. Those types will dictate how a node is treated during query execution.

Part 3 : Query execution

The parse-tree can now be passed to the function aqlrun.c:aqlEval() which executes the query starting from the root.

Fuzzy logic : The query is executed twice, in fuzzy and non-fuzzy mode. The first-pass in fuzzy mode will never open any objects and therefore execute very quickly. Any result in fuzzy mode can have one of 3 states : NULL, FUZZY or KNOWN. Some queries can be satisfied without ever having to open any objects and will return a KNOWN-value after the fuzzy-pass (which saves time).
Most queries however, will have to open an object at some time to get a result. During the fuzzy-run the result will have the FUZZY state. Any expression using a FUZZY value will be tainted as FUZZY too, so if the result of the root-node is still fuzzy after the first-pass a re-run is required.
The second-pass runs the query in non-fuzzy (i.e. exact) mode. By opening objects (the time-consuming part) all FUZZY result will be evaluated as either NULL or KNOWN. The results of the non-fuzzy re-run can never be fuzzy.

The execution loop : The guts of the query-execution is the linked list of FROM_LOC nodes which is attached to the left of the TABLE_SFW node.
The query-execution will loop over those nodes in turn and at every FROM_LOC it will exhaust all possibilities for that locator. The locator "a" will re-loop over "_1" to "_3" for every KEY found in the given class. The locator "_3" will be repeatedly processed for all its results, i.e. all data-fields that are found one position to the right of the "paper"-tag.

The type of each locator specifies the way it is being processed. There are single-valued and multi-valued locators. In our example there are the following types :

  • IS_CLASS - multi-valued - loops over all KEYS in the given class
  • IS_METHOD - single-valued - performs given method
  • IS_FOLLOW_TAG - single-valued - find the given tag in the object
  • IS_LOCAL_POS - multi-valued - position to the right, then loop down
For details on the other types, see the function aqlrun.c:loopVariables().

The other important part of the query-execution loop is the processing of each locator at every stage in the loop. Every locator is subjected to this (aqlrun.c:processThisLoc()) at the end of a loop-cycle. To find out whether the locator is valid we need to check for its where-clause and evaluate it if necessary. The boolean value of the expression-node which is attached to the right of a FROM_LOC node is used for this decision. If the where-clause expression is TRUE we can move on to the next FROM_LOC node to advance the cycle.
When the cycle gets to its end (the FROM_LOC is the end in the chain or the where-clause failed and we terminate the cycle) we have everything we need for this particualr state in the recursion cycle. It is now time to look at the select-field nodes to the right of the top-level table-expression (the TABLE_SFW node) and copy the results into the results table before continuing to loop. The select-fields are expressions, which are evaluated and their results are added to the end of the table. A LOC_VAR expression is easy to evaluate - just copy the value of its locator into the table.

The loop can now continue and the recursion will bubble up to the last multi-valued node that wasn't yet exhausted. Thereby every possible combination is visited and the results are gathered row-by-row in the results-table which is attached to the top-level TABLE_SFW node. When the recursion has finished this table-value is returned as the final result.


written by Fred Wobus, 06/Dec/1999
acedb@sanger.ac.uk