I have used package 'bit64' as a testbed to explore a couple of approaches for implementing R's univariate algorithmic functionality efficiently. I have focused on single-threaded efficiency for two reasons: 1) Amdahl's law dictates that the more we parallelize, the more we depend on serial efficiency. 2) When working with truly big data it is not only absolute speed but also energy consumption that we care about. Under the hood package 'bit64' has multiple implementations of the same functionality, and high-level functions contain (yet simple heuristic) optimizers that choose among the available low-level functions. For example 'match' can choose between eight functions based on hashing or sorting/ordering. Function 'match' (and '%in%') has been accelerated by complementing lookup of 'x' in hashed 'table' by reverse lookup of 'table' in hashed 'x'. If 'x' is small and 'table' is big, reverse lookup avoids the cost of building a huge hashmap. As suggested in Simon Urbanek's package 'fastmatch', if 'match' is called multiple times with the same 'table', performance can be improved by re-using the hashmap implicitely built by 'match'. Beyond that, I have realized a couple of improvements: 1) Building the hashmap has now been singled out in a separate function 'hashmap' that explicitely returns an environment of class c("cache_integer64", "cache", "environment") containing the hashmap and some auxilliary data. 2) Instead of implicitely caching the hashmap as a side-effect when calling 'fastmatch', there are explicit functions for caching, for example 'hashcache' for attaching a cache with a hashmap, and 'remcache' for removing any cached data. 3) If the 'hashcache' function after hashing discovers that the number of unique values is much smaller than the total number of values, it will hash again using a much smaller hashmap: this typically saves a lot of RAM and accelerates usage of the hashmap because it reduces random access. 4) The cache layer has a mechanism for detecting outdated caches. This is even more important in the case of a cached hashmap, since R's typical hashmap only contains index pointers to the data, not the data itself (unlike in standard hashtables). As a result, an outdated cache might lead to a crash, if the data has changed since creation of the cached hashmap. The detection mechanism comes for free, since R does Copy-on-write and each change of a vector leads to memory reallocation: on each cache access we check for a modified vector address and remove the cache with a warning in case of a detected change. However, this method is of-course not failsafe in case of multiple changes. Therefore, until cache checking and removal is done in Base R, users using caching should carefully remove caches before modifying data. Users must also carefully remove caches before using functions that do in-place modifications such as 'ramsort', 'ramorder' and 'ramsortorder'. Users should also note that R's COPY-ON-MODIFY mechanism does more copying than one would expect: just reading from variable length arguments with the recommended 'list(...)' construct always copies the arguments and invalidates caches. For a workaround see the implementation of 'table.integer64'. 5) Beyond 'match', the package leverages speed gains of hashing or cached hashing for a couple of other high-level functions: '%in%', 'duplicated', 'unique', 'unipos' and 'table'. However, it turned out that only 'match', '%in%' and 'duplicated' benefit from a cached hashmap. For 'unique', 'unipos' and 'table' the cost of traversing an existing hashmap is as high as creating the hashmap from scratch. That leads to the undesireable effect that we need two implementations for each of these methods: one that simultaneously builds and uses the hashmap, and another that uses an existing hashmap (using the simultaneous method while a hashmap has been cached would duplicate RAM consumption). 6) Beyond leveraging hashing, all these high-level functions also have two low-level implementations that take advantage of (cached) ordering and (cached) sortordering instead (see order below). 6) Additional functions are implemented that benefit only from (cached) ordering and (cached) sortordering: 'sort', 'order', 'tiepos', 'keypos', 'rank', 'quantile' and dependants thereof ('median','summary','as.factor','as.ordered','table'). Method 'sort' is a cache-aware wrapper around 'ramsort', which depending on context chooses from multiple sorting algorithms (or from the cache): 'shellsort' (R's traditional inplace sorting algorithm), 'quicksort' (faster inplace), 'mergesort' (fast and stable), 'radixsort' (stable with linear scalability, for large datasets). The quicksort algorithm implemented here is in this context faster than the famous one of Bentley and McIllroy. It uses median of three random pivots and is like introsort protected against O(n^2) runtime (if a recursion limit is reached, it for now falls back to shellsort instead of heapsort). Function 'order.integer64' with option 'optimize = "memory"' calls 'ramorder' which chooses from a similar set of low-level algorithms. 'ramorder' - like in package 'ff' - is faster than ordering in Base R, but like 'order' in Base R still does the job by sorting index pointers to the data which creates heavy random access to the data. The novel 'ramsortorder' method realizes ordering close to the speed of sorting, by sorting index and data simultaneously and thereby avoiding heavy random access. Therefore the option 'optimize = "time"' is the default in 'order.integer64' and calls 'ramsortorder'. Function 'rank.integer64' implements only 'ties.method = "average"' and 'na.last="keep"' (the only sensible default, see e.g. 'cor'). Function 'prank.integer64' projects the values [min..max] via ranks [1..n] to [0..1]. 'qtile.integer64' is the inverse function of 'prank.integer64' and projects [0..1] to [min..max]. 'quantile.integer64' with 'type=0' and 'median.integer64' are convenience wrappers to 'qtile'. 'qtile' behaves very similar to 'quantile.default' with 'type=1' in that it only returns existing values, it is mostly symetric but it is using 'round' rather than 'floor'. Note that this implies that 'median.integer64' does not interpolate for even number of values (interpolation would create values that could not be represented as 64-bit integers). Function 'table.integer64' leverages hashing or sorting for counting frequencies of all unique values. This is by factor 3 slower than 'tabulate', but when called with 'return="list"' is by order of magnitude faster than 'table' (because 'table' wastes a lot of performance in large scale raw data manipulation before calling tabulate and in attaching the unique values as 'names' which loads heavy on the global string cache). When dealing with combinations of input vectors, 'table.integer64' can handle up to 2^63 hypothetical combinations and can return the existing combinations in a sparse format, whereas standard 'table' theoretically bails out at 2^31 (practically earlier due to RAM limitations) and insists on returning a full blown dense array. I compared the speed gains of hashing+usage versus sortordering+usage over a couple of univariate algorithmic operations: hashing and sortordering are competitive, with hashing rather winning for smaller and sortordering rather winning for larger vectors (due to better cache-obliviousness of sorting). The third option - ordering - is much slower, though competitive with Base R, and 50% RAM saving makes this an interesting option, especially when working with datasets close to the RAM limits. Though operations based on hashing can be faster than those on sortordering it is worth to note that if sortorder is cached, in most cases going with the sortorder-operation is faster than building the hashmap and using it. Thus sortordering seems a better RAM investement than hashing. It has the following advantages: - sortordering supports more functionality than hashing - sortordering gives better modularity (different from hashing, we can well separate *creating* and *using* the sortordering, because sorting permanently improves cache-locality) - without computational costs of keeping the original order ('keep.order=TRUE' in 'unique' and 'table'), sortorder gives sorted results while hashing gives random result order. If there are many unique values, fixing random order by sorting afterwards kills any performance benefit of hashing, compare for example the sequence {y <- unique(x); ind <- sort.list(y)} in 'factor'. - sorting better generalizes to very large data on disk compared to hashing - it is easier to lockfree parallelize sorting compared to hashing - creating the ordering quickly via sortordering and then caching only ordering (without the sorted data) is an interesting option to save RAM without too much speed loss - with ordering instead of sortordering there is an option to work with large borderline-sized datasets in-RAM These advantages of sorting over hashing are good news for my novel energy-efficient greeNsort® algorithms. The long term roadmap for packages 'bit64' and 'ff' is - demonstrate power of greeNsort® by accelerating integer64 sorting by yet another factor 2 - parallelization of important functions in bit64 - unifying the sort capabilities in ff with those in bit64 (logical, factor, integer, integer64, double) - generalizing the fast data management to all numeric data types (integer, integer64, double) - removing the 2^31-1 address limit in ff (rather using integer64 than double) - providing ff with proper disk sorting (reducing n*log(n) passes to 2 passes over the memory-mapped disk) © 2010-2012 Jens Oehlschlägel