An experimental evaluation and analysis of database cracking
- First Online:
- Received:
- Revised:
- Accepted:
- 549 Downloads
Abstract
Database cracking has been an area of active research in recent years. The core idea of database cracking is to create indexes adaptively and incrementally as a side product of query processing. Several works have proposed different cracking techniques for different aspects including updates, tuple reconstruction, convergence, concurrency control, and robustness. Our 2014 VLDB paper “The Uncracked Pieces in Database Cracking” (PVLDB 7:97–108, 2013/VLDB 2014) was the first comparative study of these different methods by an independent group. In this article, we extend our published experimental study on database cracking and bring it to an up-to-date state. Our goal is to critically review several aspects, identify the potential, and propose promising directions in database cracking. With this study, we hope to expand the scope of database cracking and possibly leverage cracking in database engines other than MonetDB. We repeat several prior database cracking works including the core cracking algorithms as well as three other works on convergence (hybrid cracking), tuple reconstruction (sideways cracking), and robustness (stochastic cracking), respectively. Additionally to our conference paper, we now also look at a recently published study about CPU efficiency (predication cracking). We evaluate these works and show possible directions to do even better. As a further extension, we evaluate the whole class of parallel cracking algorithms that were proposed in three recent works. Altogether, in this work we revisit 8 papers on database cracking and evaluate in total 18 cracking methods, 6 sorting algorithms, and 3 full index structures. Additionally, we test cracking under a variety of experimental settings, including high selectivity (Low selectivity means that many entries qualify. Consequently, a high selectivity means, that only few entries qualify) queries, low selectivity queries, varying selectivity, and multiple query access patterns. Finally, we compare cracking against different sorting algorithms as well as against different main memory optimized indexes, including the recently proposed adaptive radix tree (ART). Our results show that: (1) the previously proposed cracking algorithms are repeatable, (2) there is still enough room to significantly improve the previously proposed cracking algorithms, (3) parallelizing cracking algorithms efficiently is a hard task, (4) cracking depends heavily on query selectivity, (5) cracking needs to catch up with modern indexing trends, and (6) different indexing algorithms have different indexing signatures.
Keywords
Adaptive indexing Database cracking Sorting Multi-threaded algorithms1 Introduction
1.1 Background
Database cracking example
1.2 Our focus
Database Cracking has been an area of active research in recent years, led by researchers from CWI Amsterdam. This research group has proposed several different indexing techniques to address different dimensions of database cracking, including updates [14], tuple reconstruction [15], convergence [16], concurrency control [8, 9], and robustness [11]. In this paper, we critically review database cracking in several aspects. We repeat the core cracking algorithms, i.e. crack-in-two and crack-in-three [17], as well as three advanced cracking algorithms [11, 15, 16]. We identify the weak spots in these algorithms and discuss extensions to fix them. Additionally, we inspect a recently published work [23], which identifies CPU efficiency problems in the standard cracking algorithm and proposes alternatives. Furthermore, we investigate the current state-of-the-art in parallel cracking algorithms [2, 8, 9, 23] and compare them against each other. Finally, we also extend the experimental parameters previously used in database cracking, e.g. by varying the query selectivities and by comparing against more recent, main memory optimized indexing techniques, including ART [20].
- 1.
Revisiting cracking We revisit the core cracking algorithms, i.e. crack-in-two and crack-in-three [17], and compare them for different positions of the pivot elements. We do a cost breakdown analysis of the cracking algorithm into index lookup, data shuffle, index update, and data access costs. Further, we identify four major concerns, namely CPU efficiency, convergence, tuple reconstruction, and robustness. In addition, we evaluate advanced cracking algorithms, namely predication cracking [23], hybrid cracking [16], sideways cracking [15], and stochastic cracking [11], respectively, which were proposed to address these concerns. Additionally, in order to put together the differences and similarities between different cracking algorithms, we classify the cracking algorithms based on the strategy to pick the pivot, the creation time, and the number of partitions (Sect. 2).
- 2.
Extending cracking algorithms In order to better understand the cracking behaviour, we modify three advanced cracking algorithms, namely hybrid cracking [16], sideways cracking [15], and stochastic cracking [11]. We show that buffering the swap elements in a heap before actually swapping them (buffered swapping) can lead to better convergence than hybrid cracking. Next, we show that covering the projection attributes with the cracker column (covered cracking) scales better than sideways cracking in the number of projected attributes. Finally, we show that creating more balanced partitions upfront (coarse-granular indexing) achieves better robustness in query performance than stochastic cracking. We also map these extensions to our cracking classification (Sect. 3).
- 3.
Extending cracking experiments As a next step, we extend the cracking experiments in order to test cracking under different settings. First, we compare database cracking against full indexing using different sorting algorithms and index structures. In previous works on database cracking, quick sort is used to order the data indexed by the traditional methods that are used for comparison. Further, the cracker index is realized by an AVL-tree [1] to store the index keys. In this paper, we do a reality check with recent developments in sorting and indexing for main memory systems. We show that full index creation with radix sort is twice as fast as with quick sort. We also show that ART [20] is up to 3.6 times faster than the AVL-tree in terms of lookup time. We also vary the query selectivity from very high selectivity to medium selectivity and compare the effects. We conclude two key observations: (1) the choice of the index structure has an impact only for very high selectivities, i.e. higher than \(10^{-6}\) (one in a million); otherwise, the data access costs dominate the index lookup costs; and (2) cracking creates more balanced partitions and hence converges faster for medium selectivities, i.e. around \(10\,\%\). We also look at the effect of stopping the cracking process at a certain partition size. Furthermore, we apply a sequential and a skewed query access pattern and analyse how the different adaptive indexing methods cope with them. Our results show that sequential workloads are the weak spots of query-driven methods while skewed patterns increase the overall variance (Sect. 4).
- 4.
Parallelizing cracking algorithms As exploiting modern hardware implies using the multi-threading capabilities of the system, we investigate in this section how the cracking algorithms can be parallelized. To do so, we first re-evaluate a lock-based parallel standard cracking algorithm [8, 9] that serializes the crack-in-two operation at the granularity of partitions of data for inter-query parallelism. Additionally, we add the algorithms we proposed in our study on parallel adaptive indexing algorithms [2] and put the methods under a new set-up to the test. We include our parallel coarse-granular index that builds upon parallel standard cracking. We compare these methods to our intra-query parallel versions of standard cracking and coarse-granular index [2] that realize concurrency by dividing the column into chunks. We compare the parallel cracking algorithms with our two competitive parallel radix sort implementations [2] to evaluate the relation between cracking and sorting in a multi-threaded environment. Furthermore, we propose two realizations of parallel sideways cracking to put them to the test. Last but not least, we evaluate all parallel algorithms under skewed queries, skewed input, and clustered input (Sect. 5).
- 5.
Conclusion Finally, we conclude by putting together the key lessons learned. Additionally, we also introduce signatures to characterize the indexing behaviour of different indexing methods and to understand as well as differentiate them visually (Sect. 6).
Experimental set-up We use a common experimental set-up throughout the paper. We try to keep our set-up as close as possible to the earlier cracking works. Similar to previous cracking works, we use an integer array with \(10^8\) uniformly distributed values with a key range of [0; 100, 000]. Unless mentioned otherwise, we run 1000 random queries, each with selectivity \(1\,\%\). The queries are of the form: \(\mathtt{SELECT\,A\,FROM\,R\,WHERE\,A>=low\,AND\,A<high }\). We repeat the entire query sequence three times and take the average runtime of each query in the sequence. We consider two baselines: (1) scan which reads the entire column and post-filters the qualifying tuples, and (2) full index which fully sorts the data using quick sort and performs binary search for query processing. If not stated otherwise, all indexes are unclustered and uncovered. We implemented all algorithms in a stand-alone program written in C/C++ and compile with G++ version 4.7 using optimization level 3. Our test bed consists of a single machine with two Intel Xeon X5690 processors running at a clock speed of 3.47 GHz and supports Intel Turbo Mode. Each CPU has 6 cores and supports 12 threads via Intel Hyper Threading. The L1 and L2 cache sizes are 64 KB and 256 KB, respectively, for each core. The shared L3 cache has a size of 12 MB. Our machine has 200 GB of main memory and runs a 64-bit linux with kernel 3.1.
2 Revisiting cracking
Let us start by revisiting the original cracking algorithm [17]. Our goal in this section is to first compare crack-in-two with crack-in-three, then to repeat the standard cracking algorithm under similar settings as in previous works, then to break down the costs of cracking into individual components, and finally to identify the major concerns in the original cracking algorithm.
2.1 Crack-in-two versus crack-in-three
Crack-in-twoPartition the index column into two pieces using one end of a range query as the boundary.
Crack-in-threePartition the index column into three pieces using the two ends of a range query as the two boundaries.
Revisiting standard cracking. a Comparing single query indexing time. b Reproducing cracking behaviour. c Cost breakdown
2.2 Standard cracking algorithm
Standard crackingIncrementally and adaptively sort the index column using crack-in-three when both ends of a range query fall in the same partition and crack-in-two otherwise.
We implemented the standard cracking algorithm which uses crack-in-three wherever two split lines lie in the same partition, and tested it under the same settings as in previous works. As in the original papers, we use an AVL-tree as cracker index to be able to compare the results. Figure 2b shows the results. We can see that standard cracking starts with similar performance as full scan and gradually approaches the performance of full index. Moreover, the first query takes just 0.3 s compared to 0.24 s of full scan,1 even though standard cracking invests some indexing effort. In contrast, full index takes 10.2 s to fully sort the data before it can process the first query. This shows that standard cracking is lightweight and it puts little penalty on the first query. Overall, we are able to reproduce the cracking behaviour of previous works.
2.3 Cost breakdown
Key concerns in standard cracking. a CPU efficiency, b cracking convergence, c scaling projected attributes, d cracking variance
2.4 Key concerns in standard cracking
- 1.
CPU efficiency How an algorithm is mapped to the underlying hardware is crucial in memory resident data processing. Figure 3a shows the branch misprediction2 as the weak spot of the crack-in-two algorithm with respect to the relative position of the split line, making this method clearly CPU bound. At a worst-case position of the split line dividing the partition in the middle, more than \(50\,\%\) of the branches are predicted incorrectly.
- 2.
Cracking convergence Convergence is a key concern and major criticism for database cracking. Figure 3b shows the number of queries after which the query response time of standard cracking is within a given percentage of full index. The figure also shows a bezier smoothened curve of the data points. From the figure, we can see that after 1000 queries, on average, the query response time of standard cracking is still \(40\,\%\) higher than that of full index.
- 3.
Scaling projected attributes By default, database cracking leads to an unclustered index, i.e. extra lookups are needed to fetch the projected attributes. Figure 3c shows the query response time with tuple reconstruction, when varying the number of projected attributes from 1 to 10. For the ease of presentation, we show only the bezier smoothened curves. We can see that standard cracking does not scale well with the number of attributes. In fact, after 1000 queries, querying 10 attribute tuples is almost 2 orders of magnitude slower than querying 1 attribute tuples.
- 4.
Cracking variance Standard cracking partitions the index column based on the query ranges of the selection predicate. As a result, skewed query range predicates can lead to skewed partitions and thus unpredictable query performance. Figure 3d shows the variance of standard cracking query response times using the exponentially weighted moving average (EWMA). The variance is calculated as described in [6]. The degree of weighting decrease is \(\alpha =0.1\). We can see that unlike full index (see Fig. 2b), cracking does not exhibit stable query performance. Furthermore, we also see that the amount of variance for standard cracking decreases by five orders of magnitude.
- 5.
Cracking parallelization The support of concurrency is crucial for performance on modern multi-core hardware. Therefore, the cracking algorithms must be extended to scale well with the available computing cores. As this is a challenging task, we will investigate this separately in Sect. 5.
2.5 Advanced cracking algorithms
Several follow-up works on cracking focussed on the key concerns in standard cracking. In this section, we revisit these advanced cracking techniques.
Predication and vectorized crackingDecouple pivot comparison and physical reorganization by moving elements speculatively and correcting wrong decisions afterwards.
The technique of predication cracking [23] directly attacks a major problem in standard cracking—excessive branch misprediction leading to large amounts of unnecessarily executed code. In contrast to standard cracking, where based on the outcome of the comparison of the element with the pivot, pointers are moved and elements are swapped, and predication cracking speculatively performs these reorganizations and interleaves them with the comparison evaluations of pivot and elements. When the result of the comparison is available, the incorrectly applied reorganizations are corrected. To ensure that the speculative writing does not cause data loss, the overwritten elements are backed up separately. This concept makes the algorithm completely branch free, and thus, the misprediction penalties do not longer exist. On the downside, the speculative writing adds an overhead compared to standard cracking. The question is now whether this trade-off can improve the runtime.
In predication cracking, the granularity at which data is backed up is fixed to a single element. Thus, the authors propose a generalization of the concept in form of vectorized cracking, where data is backed up and partitioned in larger blocks of adjustable size. This further decouples the costly backing up of data from the actual partitioning.
Standard cracking in comparison with predication cracking and vectorized cracking. a Single query indexing time, b single crack indexing time on 16 B elements, c single crack indexing time on 4 B elements
CPU efficiency
Predication and vectorized cracking over query sequence
Hybrid crackingCreate unsorted initial runs which are physically reorganized and then adaptively merged for faster convergence.
Revisiting three advanced cracking algorithms. a Hybrid cracking, b sideways cracking, c stochastic cracking
Sideways crackingAdaptively create, align, and crack every accessed selection-projection attribute pair for efficient tuple reconstruction.
Sideways Cracking [15] uses cracker maps to address the issue of inefficient tuple reconstruction in standard cracking, as shown in Fig. 3c. A cracker map consists of two logical columns, the cracked column and a projected column, and it is used to keep the projection attributes aligned with the selection attributes. When a query comes in, sideways cracking creates and cracks only those crackers maps that contain any of the accessed attributes. As a result, each accessed column is always aligned with the cracked column of its cracker map. If the attribute access pattern changes, then the cracker maps may reflect different progressions with respect to the applied cracks. Sideways cracking uses a log to record the state of each cracker map and to synchronize them when needed. Thus, sideways cracking works without any workload knowledge and adapts cracker maps to the attribute access patterns. Further, it improves its adaptivity and reduces the amount of overhead by only materializing those parts of the projected columns in the cracker maps which are actually queried (partial sideways cracking).
We reimplemented sideways cracking similar to as described above, except that we store cracker maps in row layout instead of column layout. We do so because the two columns in a cracker map are always accessed together and a row layout offers better tuple reconstruction. In addition to the cracked column and the projected column, each cracker map contains the rowIDs that map the entries into the base table as well as a status column denoting which entries of the projected column are materialized. Figure 7b shows the performance of sideways cracking in comparison. In this experiment, the methods have to project one attribute while selecting on another. In addition to the unclustered version of full index, we also show the clustered version (clustered full index). We can see that sideways cracking outperforms all unclustered methods after about 100 queries and approaches the query response time of clustered full index. Thus, sideways cracking offers efficient tuple reconstruction.
Stochastic crackingCreate more balanced partitions using auxiliary random pivot elements for more robust query performance.
Stochastic cracking [11] addresses the issue of performance unpredictability in database cracking, as seen in Fig. 3d. A key problem in standard cracking is that the partition boundaries depend heavily on the incoming query ranges. As a result, skewed query ranges can lead to unbalanced partition sizes and successive queries may still end up rescanning large parts of the data. To reduce this problem, stochastic cracking introduces additional cracks apart from the query-driven cracks at query time. These additional cracks help to evolve the cracker index in a more uniform manner. Stochastic cracking proposes several variants to introduce these additional cracks, including data driven and probabilistic decisions. By varying the amount of auxiliary work and the crack positions, stochastic cracking manages to introduce a trade-off situation between variance on one side and cracking overhead on the other side. We reimplemented the MDD1R variant of stochastic cracking, which showed the best overall performance in [11]. In this variant, the partitions in which the query boundaries fall are cracked by performing exactly one random split. Additionally, while performing the random split, the result of each partition at the boundary of the queried range is materialized in a separate view. At query time, the result is built by reading the data of the boundary partitions from the views and the data of the inner part from the index. Figure 7c shows the MDD1R variant of stochastic cracking. We can see that stochastic cracking (MDD1R) behaves very similar to standard cracking, although the query response times are overall slower than those of standard cracking. As the uniform random access pattern creates balanced partitions by default, the additional random splits introduced by stochastic cracking (MDD1R) do not have an effect. We will come back to stochastic cracking (MDD1R) with other access patterns in Sect. 4.5.
2.6 Cracking classification
Classification of cracking algorithms
- 1.
Zero: The trivial case is when a method introduces no split line and each query performs a full scan.
- 2.
Few: Most cracking algorithms introduce a few split lines at a time. For instance, standard cracking introduces either one or two splits lines for each incoming query. Similarly, sideways cracking introduces split lines in all accessed cracker maps.
- 3.
Several: Cracking algorithms can also introduce several split lines at a time. For example, hybrid crack sort may introduce several thousand initial partitions and introduce either one or two split lines in each of them. Thus, generating several split lines in total.
- 4.
All: The extreme case is to introduce all possible split lines, i.e. fully sort the data. For example, hybrid crack sort fully sorts the final partition, i.e. introduces all split lines in the final partition.
- 1.
Query based: The standard case is to pick the split lines based on the selection predicates in the query, i.e. the low and high keys in the query range.
- 2.
Data based: We can also split data without looking at a query. For example, full sorting creates split lines based only on the data.
- 3.
Random: Another strategy is to pick the split lines randomly as in stochastic cracking.
- 4.
None: Finally, the trivial case is to not have any split strategy, i.e. do not split the data at all and perform full scan for all queries.
- 1.
Upfront: A cracking algorithm could perform the splits before answering any queries. Full indexing falls in this category.
- 2.
Per Query: All cracking algorithms we discussed so far perform splits when seeing a query.
- 3.
Never: The trivial case is to never perform the splits, i.e. fully scanning the data for each query.
3 Extending cracking algorithms
In this section, we discuss the weaknesses in the advanced cracking algorithms and evaluate possible directions on how to improve them.
3.1 Improving cracking convergence
Comparing convergence of standard cracking, hybrid cracking and buffered swapping. a Convergence speed towards full index, b influence on swap count, c influence on query response time
Buffered swappingInstead of swapping elements immediately after identification by the cracking algorithm, insert them into heaps and flush them back into the index as sorted runs.
Let us look at the crack-in-two operation3 in hybrid cracking. Recall that the crack-in-two operation scans the dataset from both ends until we find a pair of entries which need to be swapped (i.e. they are in the wrong partitions). This pair is then swapped, and the algorithm continues its search until the pointers meet. Note that there is no relative ordering between the swapped elements and they may end up getting swapped again in future queries, thus penalizing them over and over again. We can improve this by extending the crack-in-two operation to buffer the elements identified as swap pairs, i.e. buffered crack-in-two. Buffered crack-in-two collects the swap pairs in two heaps: a max-heap for values that should go to the upper partition and a min-heap for values that should go to the lower partition. In addition to the heap structures, we maintain two queues to store the empty positions in the two partitions. The two heaps keep the elements in order and when the heaps are full we swap the top elements in the two heaps to the next available empty position. This process is repeated until no more swap pairs can be identified and the heaps are empty. As a result of heap ordering, the swapped elements retain a relative ordering in the index after each cracking step. This order is even valid for entries that were not in the heap at the same time, but shared presence with a third element, and hence, a transitive relationship is established. Every pair element that is ordered in this process is never swapped in future queries, and thus, the number of swaps is reduced. The above approach of buffered crack-in-two is similar to [21], where two heaps are used to improve the stability of the replacement selection algorithm. By adjusting the maximal heap size in buffered crack-in-two, we can tune the convergence speed of the cracked index. Larger heap size results in larger sorted runs. However, larger heaps incur high overhead to keep its data sorted. In the extreme case, a heap size equal to the number of (swapped) elements results in full sorting while a heap size of 1 falls back to standard crack-in-two. Of course buffered crack-in-two can be embedded in any method that uses the standard crack-in-two algorithm. To separate it from the remaining methods, we integrate it into a new technique called buffered swapping that is a mixture of buffered and standard crack-in-two. For the first n queries, buffered swapping uses buffered crack-in-two. After that, buffered swapping switches to standard cracking-in-two. Figure 8b shows the number of swaps in standard cracking, hybrid crack sort, and buffered swapping over 1000 queries. In order to make them comparable, we force all methods to use only crack-in-two operations. For buffered swapping, we vary the number buffered queries \(n_b\) along the X axis, i.e. the first \(n_b\) queries perform buffered swapping, while the remaining queries still perform the standard crack-in-two operation. We vary the maximal heap size from \(100\,\hbox {K}\) to \(10\,\hbox {M}\) entries. From Fig. 8b, we can see that the number of swaps decrease significantly as \(n_b\) varies from 1 to 1000. Compared to standard cracking, buffered swapping saves about 4.5 million swaps for 1 buffered query and 73 million swaps for 1000 buffered queries and a heap size of \(1\,\hbox {M}\). The maximal size of the heap is proportional to the reduction in swaps. Furthermore, we can observe that the swap reduction for 1000 buffered queries improves only marginally over that of 100 buffered queries. This indicates that after 100 buffered queries the cracked column is already close to being fully sorted. Hybrid cracking performs even more swaps than standard cracking (including moving the qualifying entries from the initial partitions to the final partition).
Next let us see the actual runtimes of buffered swapping in comparison with standard cracking and hybrid crack sort. Figure 8c shows the result. We see that the total runtime grows rapidly as the number of buffered queries (\(n_b\)) increases. However, we also see that the query time after performing buffered swapping improves. For example, after performing buffered swapping with a maximal heap size of \(1\,\hbox {M}\) for just 10 queries, the remaining 990 queries are 1.8 times faster than hybrid crack sort and even \(5.5\,\%\) faster than standard cracking. This shows that buffered swapping helps to converge better by reducing the number of swaps in subsequent queries. Interestingly, a larger buffer size does not necessarily imply a higher runtime. For 100 and 1000 buffered queries, the buffered part is faster for a maximum heap size of \(10\,\hbox {M}\) entries than for smaller heaps. This is because such a large heap size leads to an earlier convergence towards the full sorting. Nevertheless, the high runtime of initial buffer swapped queries is a concern. In our experiments, we implemented buffered swapping using the gheap implementation [7] with a fan-out of 4. Each element that is inserted into a gheap has to sink down inside of the heap tree to get to its position. This involves pairwise swaps and triggers many cache misses. Exploring more efficient buffering mechanisms in detail opens up avenues for future work.
3.2 Improving tuple reconstruction
Comparing tuple reconstruction cost of standard, sideways, and covered cracking. a Varying number of projected attributes for sideways cracking, b varying number of projected attributes for covered cracking, c covering trade-off for tuple reconstruction
Covered crackingGroup multiple non-key attributes with the cracked column in a cracker map. At query time, crack all covered non-key attributes along with the key column for even more efficient tuple reconstruction.
Note that with sideways cracking all projected columns are aligned with each other. However, the query engine still needs to fetch the projected attribute values from different columns in different cracker maps. These lookup costs turn out to be very expensive in addition to the overhead of cracking n replicas of the indexed column for n projected attributes. To solve this problem, we generalize sideways cracking to cover the n projected attributes in a single cracker map. In the following, we term this approach covered cracking. While cracking, all covered attributes of a cracker map are reorganized with the cracked column. As a result, all covered attributes are aligned and stored in a consecutive memory region, i.e. no additional fetches are involved if the accessed attribute is covered. However, the drawback of this approach is that we need to specify which attributes to cover. To be on a safer side, we may cover all table attributes. However, this means that we will need to copy the entire table for indexing. We can think of fixing this by adaptively covering the cracked column, i.e. not copying the covered attributes upfront but rather on-demand when they are accessed. An option is to copy the covered attribute columns when they are accessed for the first time. An even more fine granular approach is to copy only the accessed values of covered attributes and thus reflecting the query access pattern in the covering status. Figure 9b shows the performance of covered cracking over different numbers of projected attributes. Here we show the results from covered cracking which copies the data of all covered attributes in the beginning. We can see that covered cracking remains stable when varying the number of projected attributes from 1 to 10. Thus, covered cracking scales well with the number of attributes. Figure 9c compares the accumulated costs of standard, sideways, and covered cracking. We can see that while the accumulated costs of standard and sideways cracking grow linearly with the number of attributes, the accumulated costs of covered cracking remain pegged at under 40 s. We also see that sideways cracking outperforms covered cracking for only up to 4 projected attributes. For more than 4 projected attributes, sideways cracking becomes increasingly expensive, whereas covered cracking remains stable. Thus, we see that covering offers huge benefits.
3.3 Improving cracking robustness
In this section, we look at how well stochastic cracking [11] addresses the issue of query robustness and whether we can improve upon it. In Fig. 7c, we can observe that stochastic cracking is more expensive (for first as well as subsequent queries) than standard cracking. On the other hand, the random splits in stochastic cracking (MDD1R) create uniformly sized partitions. Thus, stochastic cracking trades performance for robustness. So the key question now is: Can we achieve robustness without sacrificing performance? Can we have high robustness and efficient performance at the same time?
Coarse-granular indexCreate balanced partitions using range partitioning upfront for more robust query performance. Apply standard cracking later on.
Stochastic cracking successively refines the accessed data regions into smaller equal-sized partitions, while the non-accessed data regions remain as large partitions. As a result, when a query touches a non-accessed data region, it still ends up shuffling large portions of the data. To solve this problem, we extend stochastic cracking to create several equal-sized4 partitions upfront, i.e. we pre-partition the data into smaller range partitions. With such a coarse-granular index, we shuffle data only inside a range partition, and thus, the shuffling costs are within a threshold. Note that in standard cracking, the initial queries have to anyways read huge amounts of data, without gathering any useful knowledge. In contrast, the coarse-granular index moves some of that effort to a prepare step to create meaningful initial partitions. As a result, the cost of the first query is slightly higher than standard cracking but still significantly less than full indexing. With such a coarse-granular index, users can choose to allow the first query to be a bit slower and witness stable performance thereafter. Also, note that the first query in standard cracking is anyways slower than a full scan since it partitions the data into three parts. Coarse-granular index differs from standard cracking in that it allows for creating any number of initial partitions, not necessarily three. Furthermore, by varying the number of initial partitions, we can trade the initialization time for more robust query performance. This means that, depending upon their application, users can adjust the initialization time in order to achieve a corresponding robustness level. This is important in several scenarios in order to achieve customer SLAs. In the extreme case, users can create as many partitions as the number of distinct data items. This results in a full index, has a very high initialization time, and offers the most robust query performance. The other extreme is to create only a single initial partition. This is equivalent to standard cracking scenario, i.e. very low initialization time and least robust query performance. Thus, coarse-granular index covers the entire robustness spectrum between standard cracking and full indexing.
Comparing robustness of standard cracking, stochastic cracking, coarse-granular index, and full index. a Variance in response time, b initialization time trade-off, c extending Fig. 9c by nearby clustering
We can also extend the coarse-granular index and pre-partition the base table along with the cracker column. This means that we range partition the source table in exactly the same way as the adaptive index during the initialization step. Though, we still refine only the cracked column for each incoming query. The source table is left untouched. If the partition is small enough to fit into the cache, then the tuple reconstruction costs are negligible because of no cache misses. Essentially, we decrease the physical distance between external random accesses, i.e. the index entry and the corresponding tuple are nearby clustered. This increases the likelihood that tuple reconstruction does not incur any cache misses. Thus, as a result of pre-partitioning the source table, we can achieve more robust tuple reconstruction without covering the adaptive index itself, as in covered cracking in Sect. 3.2. However, we need to pick the partition size judiciously. Larger partitions do not fit into the cache, while smaller partitions result in high initialization time. Note that if the data is stored in row layout, then the source table is anyways scanned completely during index initialization and so pre-partition is not too expensive. Furthermore, efficient tuple reconstruction using nearby clustering is limited to one index per table, same as for all primary indexes.
Scalability under data size increase by factor 10
Factor slower (from 100 M to 1 B) | Initialization | Remaining | TOTAL |
---|---|---|---|
Standard cracking | 10.01 | 9.92 | 9.93 |
Stochastic cracking (MDD1R) | 12.92 | 9.57 | 9.75 |
Coarse-granular index 10 | 11.73 | 9.92 | 10.56 |
Coarse-granular index 100 | 11.72 | 9.81 | 10.79 |
Coarse-granular index 1 K | 11.69 | 9.96 | 11.09 |
Coarse-granular index 10 K | 11.31 | 9.94 | 10.95 |
Coarse-granular index 100 K | 10.90 | 10.02 | 10.73 |
Full index | 11.48 | 9.97 | 11.29 |
Sideways cracking | – | – | 11.92 |
Covered cracking | – | – | 9.98 |
Coarse-granular index 100 (nearby clustered) | – | – | 11.64 |
Coarse-granular index 1 K (nearby clustered) | – | – | 13.33 |
Classification of extended cracking algorithms
4 Extending cracking experiments
In this section, we compare cracking with different sort and index baselines in detail. Our goal here is to understand how good or bad cracking is in comparison with different full indexing techniques. In the following, we first consider different sort algorithms, then different index structures, and finally the effect of query selectivity.
4.1 Extending sorting baselines
The typical baseline used in previous cracking works was a full index wherein the data is fully sorted using quick sort and queries are processed using binary search to find the qualifying tuples. Sorting is an expensive operation, and as a result, the first fully sorted query is up to 30 times slower than the first cracking query (See Fig. 2b). So let us consider different sort algorithms.
Comparing standard cracking with different sort and index baselines. a Comparing different sort algorithms, b indexing effort of diff. indexes, c per-query response time of diff. indexes
4.2 Extending index baselines
Let us now consider different index structures and contrast them with simple binary search on sorted data. The goal is to see whether or not it makes sense to use a sophisticated index structure as a baseline for cracking. We consider three index structures: (1) AVL-tree [1], (2) B+-tree [3], and (3) the very recently proposed ART [20]. We choose ART since it outperforms other main memory optimized search trees such as CSB+-tree [24] and FAST [19].
Let us first see the total indexing effort of different indexing methods over 1000 queries. For binary search, we simply sort the data (radix_insert sort), while for other full indexing methods (i.e. AVL-tree, B+-tree, and ART), we load the data into an index structure in addition to sorting (radix_insert sort). Standard cracking self-distributes the indexing effort over the 1000 queries, while the remaining methods perform their sorting and indexing work in the first query. For the B+-tree, we present two different variants: one that is bulk loaded and one that is tuple-wise loaded. Figure 11b shows the results. We can see that AVL-tree is the most expensive, while standard cracking is the least expensive in terms of indexing effort. The indexing efforts of binary search and B+-tree (bulk loaded) are very close to standard cracking. However, the other B+-tree as well as ART do more indexing effort, since both of them load the index tuple by tuple.5 The key thing to note here is that bulk loading an index structure adds only a small overhead to the pure sorting. Let us now see the query performance of the different index structures. Figure 11c shows the per-query response times of different indexing methods. Surprisingly, we see that using a different index structure has barely an impact on query performance. This is contrary to what we expected and in the following let us understand this in more detail.
4.3 Effect of varying selectivity
Comparing standard cracking with index baselines while varying selectivity (note that: \((\mathbf{a}) = (\mathbf{b}) + (\mathbf{c})\)). a Accumulated query response time, b accumulated indexing time, c acc. index lookup + data access time
Figure 12b shows the accumulated indexing time for different methods when varying selectivity. Obviously, the indexing time is constant for all full indexing methods. However, the total indexing time of standard cracking changes over varying query selectivity. In fact, the indexing effort of standard cracking decreases by \(45\,\%\) when the selectivity changes from \(10^{-5}\) to \(10^{-1}\). As a result, the indexing effort by standard cracking surpasses even the effort of binary search (more than \(18\,\%\)) and B+-tree (bulk loaded) (more than \(5\,\%\)), both based on radix_insert sort for as little as 1000 queries. The reason standard cracking depends on selectivity is that with high selectivity the two partition boundaries of a range query are located close together and the index refinement per query is small. As a result, several data items are shuffled repeatedly over and over again. This increases the overall indexing effort as well as the time to converge to a full index.
Figure 12c shows the accumulated index lookup and data access costs of different methods over varying selectivity. We can see that the difference in the querying costs of different methods grows for higher selectivity. For instance, AVL-tree is more than 5 times slower than ART for a selectivity of \(10^{-8}\). We also see that standard cracking is the most lightweight method in terms of the index lookup and data access costs and is closely followed by ART. However, for high selectivities, the index lookup and data access costs are small compared to the indexing costs. As a result, the difference in the index lookup and data access costs of different methods is not reflected in the total costs in Fig. 12a.
To conclude, the take-away message from this section is threefold: (1) using a better index structure makes sense only for very high selectivities, e.g. one in a million, (2) cracking depends on query selectivity in terms of indexing effort, and (3) although cracking creates the indexes adaptively, it still needs to catch up with full indexing in terms of the quality of the index.
4.4 Effect of varying cracking depth
Stopping cracking at a certain partition size
4.5 Effect of query access pattern
Generated predicates for different access pattern. a Sequential pattern, b skewed pattern
Effect of query access pattern on adaptive methods. a Sequential access pattern, b skewed access pattern, c accumulated query response times
Finally, let us investigate how the methods perform under a skewed access pattern. We create the skewed access pattern in the following way: first, a Zipfian distribution is generated over n values, where n corresponds to the number of queries. Based on that distribution, the domain around the hotspot, which is the middle of the domain in our case, is divided into n parts. After that, the query predicates are set according to the frequencies in the distribution. The k values with the l highest frequency in the distribution lead to k query predicates lying in the \(l \text {-th}\) nearest area around the hotspot. Figure 14b shows the generated predicates for \(\alpha = 2.0\). These predicates are randomly shuffled before they are used in the query sequence. Figure 15b shows the query response time for the skewed pattern. We can observe a high variance in all methods except coarse-granular index. Between accessing the hotspot area and regions that are far off, the query response time varies by almost 3 orders of magnitude. Early on, all methods index the hotspot area heavily as most query predicates fall into that region. Stochastic cracking manages to reduce the negative impact of predicates that are far off the hotspot area. However, it is slower than standard cracking if the hotspot area is hit. Hybrid crack sort copies the hotspot area early on to its final partition and exhibits the fastest query response times in the best case. However, if a predicate requests a region that has not been queried before, copying from the initial partitions to the final partition is expensive.
Finally, Fig. 15c shows the accumulated query response time for both sequential and skewed access patterns. Obviously, handling sequential patterns is challenging for all adaptive methods; especially, hybrid crack sort suffers from large repetitive scans in all initial partitions and is therefore by far the slowest method in this scenario. Stochastic cracking (MDD1R) manages to reduce the problems of standard cracking significantly and fulfils its purpose by providing a workload robust query answering. In total, coarse-granular index is the fastest method under this pattern. Overall, for the skewed access pattern, the difference between the methods is significantly smaller than for the sequential pattern.
5 Parallelising cracking algorithms
All single-threaded algorithms evaluated in this paper
Algorithm | References |
---|---|
Standard cracking | [17] |
Predication/vectorized cracking | [23] |
Hybrid crack/radix/sort sort | [16] |
Buffered swapping | This paper resp. [25] |
Stochastic cracking (MDD1R) | [11] |
Coarse-granular index | This paper resp. [25] |
Sideways cracking, Covered cracking | |
Sorting: Quick(_insert) sort, Radix(_insert) | |
Full index: AVL-tree, B+-tree, ART |
5.1 Parallel cracking methods
All multi-threaded algorithms evaluated in this paper
Algorithm | Abbreviation | References |
---|---|---|
Parallel standard cracking | P-SC | |
Parallel coarse-granular index | P-CGI | [2] |
Parallel-chunked standard cracking | P-CSC | [2] |
Parallel-chunked vectorized cracking | P-CVC | Variant of [23] |
Parallel-chunked coarse-granular index | P-CCGI | [2] |
Parallel range-partitioned radix sort | P-RPRS | [2] |
Parallel-chunked radix sort | P-CRS | [2] |
Parallel sideways cracking (with P-CSC) | P-SW-CSC | This paper |
Parallel sideways cracking (with P-CCGI) | P-SW-CCGI | This paper |
Parallel range-partitioned radix sort (cluster complete) | P-PC-RPRS | This paper |
Parallel range-partitioned radix sort (cluster lazy) | P-LC-RPRS | This paper |
Parallel standard cracking (P-SC)Interleave answering of multiple queries in isolation by serializing crack-in-two on the granularity of partitions.
A very natural form of inter-query parallelism is realized in parallel standard cracking [8, 9], denoted as P-SC from here on. It is based on the observation that a query modifies at most two partitions of the cracker column. Thus, if we want to execute multiple queries at the same time on the same cracker column, all we have to do is serializing the cracking of partitions. To do so, the authors of [8, 9] introduce read and write locks on the partition level. An incoming query, running in its own thread, tries to acquire (at most two) write locks for the partitions at the border, that it has to crack, and read locks for the inner partitions. Since acquiring write locks is exclusive, only one query at a time can modify a certain partition. Similar to the single-threaded case, we can apply the concept of coarse-granular index to P-SC as well. To do so, the first query applies the lock-free parallel range-partitioning algorithm we used in our study on parallel cracking algorithms [2] before starting the actual query answering. We will refer to this method as P-CGI.
Parallel coarse-granular index (P-CGI)Apply a parallel range partitioning to bulk-load the cracker index before starting the query answering using P-SC.
Obviously, for P-SC and P-CGI, the degree of parallelism highly depends on the current cracking state of the cracker column and on the query access pattern. Thus, the following algorithm, that we introduced in [2], implements the parallelism inside the answering of a single query to create a more stable parallel execution over the query sequence.
Parallel-chunked standard cracking (P-CSC)Divide the cracker column non-semantically into independent chunks and apply standard cracking on each chunk in parallel.
The concept of parallel-chunked standard cracking [2], denoted P-CSC from here on, is as simple as effective. We divide the column logically into multiple chunks and treat each chunk as a separate cracker column with its own cracker index. The incoming queries are executed sequentially within the query burst, while each individual query is evaluated on all chunks in parallel. Thus, each chunk is cracked individually and produces a local query result. When all threads finish the evaluation of the query locally, the global result can be computed. As there is almost no communication or synchronization necessary during cracking and result computation, this method naturally offers a high degree of parallelism from the very first query on. The same concept has been used in [23] in combination with vectorized cracking. Thus, we also test a vectorized version, denoted as P-CVC6 from here on. Of course, the concept of work division can be applied to more advanced cracking algorithms as well. Since our coarse-granular indexing method offers an interesting alternative to the standard version, we also test our chunked implementation of parallel coarse-granular index that we first introduced in [2].
Parallel-chunked coarse-granular index (P-CCGI)Divide the cracker column non-semantically into multiple independent chunks and apply coarse-granular index on each chunk in parallel. Then, apply standard cracking locally for the query answering.
The concept remains the same. The only difference to P-CSC is an initial step of range partitioning within each chunk as performed by the single-threaded coarse-granular index. After that, the single-threaded standard cracking is used in each chunk for the local result computation. We will call this method from here on P-CCGI [2].
In comparison with the different multi-threaded cracking versions, we test our two parallel radix-based sorting methods from [2]. The first version, called P-RPRS, applies first a parallel range partitioning and then sorts the partitions locally in parallel using a single- threaded radix sort. The second version, denoted P-CRS from here on, chunks the input non-semantically and then applies single-threaded range partitioning and radix sort on each chunk in parallel.
5.2 Hardware set-up
Stream benchmark with 60 threads. We can reach 65 GB/s per socket for the aggregated read/write bandwidth per socket
5.3 Experimental set-up
Before we can start with any experimental evaluation, let us define the way in which the queries are fired and executed. As in previous experiments, we have a set of 1000 queries that should be answered as fast as possible. All queries are directly ready to be processed, and there is no artificial idle time introduced between queries. Depending on the type of the algorithms, this query batch is processed differently. For algorithms that perform inter-query-parallelism, like P-SC for instance, we divide the set of queries into k parts, which are processed using k threads with each thread working 1000 / k queries sequentially. This resembles the way of firing queries in [8]. In contrast, for algorithms that perform intra-query parallelism, like P-CSC, the 1000 queries will be answered sequentially one after another. However, each individual query is answered by k threads in parallel on a portion of the data. Please note that to get a more realistic set-up, we introduced a barrier in the query execution loop: the answering of the next query starts only after all threads completed the current one.
5.4 Scaling of parallel cracking algorithms
In Sect. 5.1, we described the set of algorithms for parallelizing database cracking. As mentioned before, many of these algorithms originate from our earlier study on parallel adaptive indexing [2]. In that work, we studied both the absolute runtimes and the scalability of the parallel cracking algorithms. In this paper, we revisit the scalability of parallel cracking algorithms in depth. To do this, we extend the parallel cracking experiments in two ways. First, in contrast to our previous study that used a low-end server with only 8 cores, we now use a massively parallel high-end machine consisting of 4 sockets and 60 physical, respectively, 120 logical cores (see Sect. 5.2 for a detailed description of the hardware). We believe that it is valuable to re-evaluate these techniques under a vastly different set-up to get possibly new insights from them. Second, we dig into and analyse the performance of parallel cracking by looking at contention and bandwidth using Intel VTune Amplifier. Our goal is to understand and explain the scalability of parallel cracking algorithms in a massively parallel environment.
Besides the raw query processing times of different algorithms, parallel methods offer another important dimension to analyse: the capabilities to scale with the multi-threading resources of the hardware. An algorithm, which scales poorly, might be the winner in terms of runtime on a small machine, but completely looses the pace on a large server. Therefore, in the following, we will inspect for each method individually how it scales with an increase of the number of threads. We run each method using 4, 8, and 15 threads to utilize the computing cores up to \(\frac{1}{4}\)-th of the machine. Additionally, we test 30, 45, and 60 threads to investigate the scaling over the sockets. Finally, we run 120 threads to utilize all logical cores of the machine as well. We do not apply any form of thread pinning and let the operating system decide.
Figure 18 presents the accumulated speedups of the algorithms relative to their single-threaded counterparts. We inspect the individual parts (copying, range partitioning, sorting, query answering) of the methods to analyse them separately as well as the total speedup. Let us go through the methods one by one and analyse their scalability.
Visualization of the partition processing contention for 8 threads. A rectangle\([x_1,y_1,x_2,y_2]\) means that within the time from \(x_1\) to \(x_2\), a thread was processing the cracker column at the range \(y_1\) to \(y_2\). Processing also includes wait times to acquire a lock. A red square indicates a writing process (cracking a partition), while a green square visualizes a reading process (querying a partition). Overlapping squares indicate that multiple threads intent to work on the same area of the cracker column at the same time. a Parallel standard cracking (P-SC), b parallel coarse-granular index (P-CGI) (colour figure online)
Speedup of parallel cracking and sorting algorithms over their single-threaded counterparts while varying the number of threads. We show both the speedups of the characteristic phases as well as the overall achieved speedups. Coloured horizontal lines show the expected perfect linear speedup. In h, we show for the chunked algorithms how the chunking itself influences the methods by serially working the chunks. a Parallel standard cracking (P-SC), b parallel coarse-granular index (P-CGI), c parallel-chunked standard cracking (P-CSC), d parallel-chunked vectorized cracking (P-CVC), e parallel-chunked coarse-granular index (P-CCGI), f parallel range-partitioned radix sort (P-RPRS), g parallel-chunked radix sort (P-CRS), h serial execution of chunked algorithms (colour figure online)
Parallel coarse-granular index (P-CGI) As described before, this algorithm extends P-SC by applying an initial parallelized range-partitioning step that creates 1024 partitions right away before any query answering starts. This should have a positive effect on P-SC and significantly reduce the contention that we have measured before. Figure 17b presents again the partition processing contention, this time for P-CGI. The blank space between time 0 and 0.7 s is the range-partitioning phase. Afterwards, we see from 0.7 s till 1.2 s the actual query answering, which indeed parallelizes nicely now. No heavy contention is visible anymore and the algorithm behaves as intended, as the partitions are already small and the chance of two threads accessing the same partition is small. This is confirmed by the scaling factors of the P-CGI query answering phase in Fig. 18b, which now reaches a factor of 11\(\times \) for 45 threads. For more threads, the performance significantly drops again, as access contention (both on the column as well as on the protected cracker index) throttles the throughput again. Figure 19a presents a query-wise view on the answering phase. We can see that directly in the first query, the scaling is still very limited. This is caused by the set-up and assignment of the threads to the tasks, which is expensive in comparison with the short running queries. Additionally, NUMA remote accesses are throttling the query answering phase. As the parallel range-partitioning algorithms creates partitions that are scattered across regions, a thread that answers a query has consequently a large number of remote accesses. Table 6 shows that based on hardware counters 2 out of 3 accesses are remote for P-CGI. Let us now look at the range partitioning itself. For 120 threads, we achieve the best speedup of factor 15\(\times \). Memory bandwidth is clearly not the problem, as it can be seen in the early phase of Fig. 19b, where only the histogram generation maximizes the bandwidth utilization. Our VTune analysis indicates that the range-partitioning algorithm is heavily back-end bound by the random nature of the partitioning. Advanced partitioning techniques like software-managed buffers and non-temporal streaming stores might improve upon this problem, as we investigate in a separate study on partitioning [26].
Parallel-chunked standard cracking (P-CSC) After looking at the inter-parallel version of standard cracking, let us now inspect the scaling behaviour of the intra-parallel version named P-CSC. The results are shown in Fig. 18c. We can see that this algorithms scales considerably better than the previous ones, which is what we expect from a method that parallelizes by chunking. However, we can also observe that the scaling is not linear with the number of threads. The highest total speedup that we achieve for 120 threads is only around 25\(\times \). To understand this behaviour, let us inspect the individual parts. Interestingly, the copying phase, which simply duplicates the input column into a separate array, scales particularly bad with a maximum speedup of 8\(\times \). As we can see from the bandwidth plot of Fig. 20a for 60 threads, the memory bus is not the limiting factor, which is poorly utilized within the first 150 ms. We identified page faults, which are surprisingly expensive to resolve when touching the cracker column for the first time during the copying phase as the cause of this behaviour. Let us now see how the query answering part alone scales in P-CSC. In Fig. 18c, we see a maximal speedup of the query answering phase of 33\(\times \) for 60 threads, which is still not linear. NUMA effects are not a problem here as we can see in Table 6, all accesses are local. Apparently, the scaling is limited from 45 threads on, so let us inspect the utilized bandwidth of the query answering phase for 30 threads (Fig. 20b), 45 threads (Fig. 20c), and 60 threads (Fig. 20d). We can see that in the early phase the bandwidth for 30 threads is with almost 59 GB/s already close to the cap of 65 GB/s, so we cannot expect a linear scaling when increasing the number of threads by a factor of 1.5\(\times \) (45 threads), respectively, 2\(\times \) (60 threads).
Bandwidth of P-CGI measured at 4 sockets in GB/s. The bottom line shows the CPU utilization in percentage. a Scaling of parallel-chunked standard cracking (P-CGI) of the query answering phase without the range-partitioning phase, b parallel-chunked standard cracking (P-CGI) with 45 threads. Highest bandwidth observed: 53.25 GB/s
From Fig. 18d, we can see that Parallel-chunked vectorized cracking (P-CVC) shows a very similar scaling behaviour as P-CSC. It scales slightly worse than P-CSC due to its nature of being even more bandwidth bound.
Parallel-chunked standard cracking (P-CCGI) Let us now inspect the intra-parallel version of coarse-granular index in Fig. 18e. Interestingly, the range-partitioning phase scales almost exactly the same as the one of P-CGI in Fig. 18b, although the former uses a parallel range partitioning while the latter one chunks a single-threaded implementation. This shows again that the partitioning is heavily back-end bound and that stalls throttle the algorithm. The query answering phase scales with 20\(\times \) for 60 threads much better than that of P-CGI. One reason is that each chunk can be processed individually without any concurrency control except the barrier at the end of each query. Another reason is the almost perfect NUMA locality that we can observe in Table 6.
Bandwidth measured at 4 sockets with Intel VTune Amplifier 2015 in GB/s. a Parallel-chunked standard cracking (P-CSC) with 60 threads. The initialization phase (copying the data into the cracker column in parallel) utilizes the bandwidth only partially (around 8 GB/s), b parallel-chunked standard cracking (P-CSC) with 30 threads without initialization phase. Highest bandwidth observed: 58.77 GB/s. c parallel-chunked standard cracking (P-CSC) with 45 threads without initialization phase. Highest bandwidth observed: 64.02 GB/s, d parallel-chunked standard cracking (P-CSC) with 60 threads without initialization phase. Highest bandwidth observed: 65.24 GB/s
Number of LLC cache misses that are served with local, respectively, remote DRAM access presented in millions of measured events
Method | Local accesses (Mio) | Remote accesses (Mio) |
---|---|---|
P-SC | 107 | 418 |
P-CGI | 99 | 202 |
P-CSC | 442 | 0 |
P-CVC | 357 | 0.2 |
P-CCGI | 44 | 0.2 |
P-RPRS | 115 | 230 |
P-CRS | 365 | 0.6 |
Accumulated query response time of parallel cracking algorithms in comparison with parallel radix-based sorting methods. a 4 threads, b 15 threads, c 60 threads
5.5 Runtime of parallel cracking algorithms
After investigating the scaling capabilities of the algorithms on an individual basis, let us see how they compete against each other. To do so, we measure and compare the accumulated query response time over 1000 queries and present the results using different thread configurations in Fig. 21. For 4 threads in Fig. 21a, there is a clear difference in runtime between the individual algorithms visible. Obviously, P-CSC has the lowest initialization time with almost 0.5 s, while the sorting methods need with around 1.7 s considerably more time for their first query. Over 1000 queries, the cracking methods P-CCGI and P-CGI clearly win in terms of accumulated runtime, while P-SC is far behind the remaining methods due to its serialization behaviour in the early querying phase. When increasing the number of threads to 15 in Fig. 21b and to 60 in Fig. 21c, we see a clear trend: the difference between the sorting and cracking methods significantly decreases. For 60 threads, the time of the first query for P-CSC is with 191 ms only 46 ms shorter than that of P-CRS, which fully sorts the chunks and answers the first query in 237 ms, caused by the superior scaling of the sort-based algorithms. This analysis indicates, that for a large number of threads, the sorting algorithms are a clear alternative over the adaptive methods, especially since they are easier to integrate into the system stack and offer interesting orders. Nevertheless, we believe that in a real system with many queries processing several columns at the same time, only a portion of the physical resources are available to initialize a column. Under such circumstances, cracking remains its advantage of offering the significantly cheapest option of enabling indexing.
5.6 Tuple reconstruction in the context of parallelism
Accumulated tuple reconstruction cost for 1000 queries and a table consisting of 10 columns, shown for 4 and 60 threads. We select on a single fixed attribute. In a and c, each query projects a single randomly chosen attribute. In b and d, each query projects five randomly selected attributes. a 4 threads: one \(\sigma \), one \(\pi \), b 4 Threads: one \(\sigma \), five \(\pi \), c 60 Threads: one \(\sigma \), one \(\pi \), d 60 threads: one \(\sigma \), five \(\pi \)
For all numbers of projected attributes, P-PC-RPRS behaves in the most predictable way. Clustering the entire table of 10 columns takes around 11 seconds and the following query answering takes only a small amount of additional time, even if 5 attributes are projected. P-LC-RPRS, which clusters a column when it is touched for the first time is heavily affected by the number of projected attributes. Interestingly, the larger the number of projected attributes, the smaller is the accumulated query response time. This makes sense as a query projecting multiple attributes can cluster multiple columns in a single sorting run. We can also observe that the lazy clustering pays off only for the first few queries, at least for a table consisting of only 10 columns. In comparison with that, parallel sideways cracking offers in both implementations a significantly smaller initialization time. The first query of P-SW-CCGI is slightly more expensive than that of P-SW-CSC, as it range partitions the dataset during the initialization of a cracker map. In the long run, it always clearly pays off to prepend a range-partitioning step. Overall, for 4 threads and 1000 queries, P-SW-CCGI shows the best accumulated runtime in all tested cases. This picture changes if we switch to 60 threads in Fig. 22c, d. Obviously, all methods benefit from the increased number of threads; however, the sort-based methods win at a higher degree. Obviously, the better scaling capabilities of the sort-based methods that we saw in the previous analysis pay off in the tuple reconstruction case as well. P-PC-RPRS needs less than 10 queries to beat both parallel sideway cracking implementations. The difference between the lazy P-LC-RPRS and P-PC-RPRS has also significantly decreased, and even P-LC-RPRS outperforms P-SW-CSC around 40 queries for 1 projected attribute and 8 queries for 5 projections. Overall, we see the same trend as before: the more threads available, the more the advantage shifts to the sorting side. Still, if only few threads are available for the initialization step, parallel sideways cracking shows a significantly smaller preparation time. Further, for tables consisting of multiple hundreds of attributes, only on-demand initialization of columns is a viable option, as offered by parallel sideways cracking.
5.7 Skew in the context of parallelism
- 1.
The query predicates follow a normal distribution with mean \(\mu = 2^{63}\) (middle of the domain). The deviation is varied from \(\sigma = 2^{58}\) (high skew) to \(\sigma = 2^{62}\) (low skew). This pattern simulates a high interest in certain keys.
- 2.
The keys of the input data follow a normal distribution with mean \(\mu = 2^{63}\) (middle of the domain). The deviation is varied from \(\sigma = 2^{58}\) (high skew) to \(\sigma = 2^{62}\) (low skew). This pattern simulates a higher appearance frequency of certain keys.
- 3.
The keys of the input data follow a uniform distribution. However, the input is physically clustered into k uniform range partitions. We test a low clustering using \(k=4\) and a high one using \(k=60\). This pattern simulates data where the key locality resembles physical locality, typically the case for sensor or financial data.
Impact of skewness variants on the methods for 60 threads. The shown numbers present the speedup over the uniform random dataset using uniformly distributed query predicates. A number smaller than 1 represents a speedup of the version under skew. a Skewed queries, b skewed input, c clustered input
Overall, we learned that the chunked methods are completely resilient to both skewed queries and input. However, in their current state, they have severe problems in handling clustered input. P-RPRS suffers from skewed input as the range-partitioning phase creates equi-width partitions that do not balance the sorting work. P-SC reacts negatively to both skewed input and queries due to the higher contention.
6 Lessons learned and conclusion
- 1.
Database cracking is a mature field of research Database cracking is a simple yet effective technique for adaptive indexing. In contrast to full indexing, database cracking is lightweight, i.e. it does not penalize the first query heavily. Rather, it incrementally performs at most one quick sort step for each query and nicely distributes the indexing effort over several queries. Moreover, database cracking indexes only those data regions which are actually touched by incoming queries. As a result, database cracking fits perfectly to the modern needs of adaptive data management. Furthermore, apart from the incremental index creation in standard cracking, several other follow-up works have looked into other aspects of adaptive indexing as well. These include updating a cracked database, convergence of database cracking to a full index, efficient tuple reconstruction, and robustness over unpredictable changes in query workload. Thus, we can say that database cracking has come a long way and is a mature field of research.
- 2.
Database cracking is repeatable In this paper, we repeated eight previous database cracking works, including standard cracking using crack-in-two and crack-in-three [17], predication cracking [23], hybrid cracking [16], sideways cracking [15], and stochastic cracking [11] as well as the whole line of parallel cracking works [2, 8, 9]. We reimplemented the cracking algorithms from each of these works and tested them under similar settings as in the previous works. Our results match very closely to the ones presented in the previous works, and we can confirm the findings of those works, i.e. hybrid cracking indeed improves in terms of convergence to full index, sideways cracking allows for more efficient tuple reconstruction, and stochastic cracking offers more predictable query performance than standard cracking. We can say that cracking is repeatable in any ad hoc query engine, other than MonetDB as well.
- 3.
Still, lot of potential to improve database cracking There is still a lot of potential to do better in several aspects of database cracking, including faster convergence to full index, more efficient tuple reconstruction, and more robust query performance. For example, by buffering the elements to be swapped in a heap, we can reduce the number of swaps and thus have better convergence. Similarly, by covering the cracked index, we can achieve better scalability in the number of projected attributes. Likewise, we can trade the initialization time to create a coarse-granular index which improves query robustness. All these are promising directions in the database cracking landscape. Thus, we believe that even though cracking has come a long way, it still has a lot more to go.
- 4.
Database cracking depends heavily on the query access pattern As the presented techniques are adaptive due to their query-driven character, each of them is more of less sensitive to the applied query access pattern. A uniform random access pattern can be considered the best case for all methods as it leads to uniform partition sizes across the data. In contrast to that sequential patterns crack the index in small steps and the algorithms have to rescan large parts of the data. Skewed access patterns lead to a high variance in runtime depending on whether the query predicate hits the hotspot area or not. Overall, stochastic cracking (MDD1R) and coarse-granular index, which extend their query- driven character by data driven influences, are less sensitive to the query access pattern than the methods that take only the seen queries into account.
- 5.
Workload selectivities affect the amount of indexing effort in database cracking Since cracking reorganizes only the accessed portions of the data, the total indexing effort varies with the query selectivities. In fact, the total indexing effort in standard cracking drops by \(45\,\%\) when the selectivity changes from \(10^{-5}\) to \(10^{-1}\). Although high selectivity queries reorganize smaller portions of the data, the reorganization happens much more often before reaching the final state. Additionally, earlier cracking works suggested to stop data reorganization at a certain partition size, in order to reduce the indexing effort. However, we saw that the overhead of additional filtering eclipses the savings from indexing effort.
- 6.
Database cracking needs to catch up with modern indexing trends We saw that for sorting radix sort is twice as fast as quick sort. After 600 queries, the total query response time of binary search based on radix sorted data is even faster than standard cracking. This means that a full sorting pays off over standard cracking in less than 1000 queries. Thus, we need to explore more lightweight techniques for database cracking to be competitive with radix sort. Furthermore, several recent works have proposed main memory optimized index structures. The recently proposed ART has 1.8 times faster lookups than standard cracking after 1000 queries and 3.6 times faster lookups than standard cracking after \(1\,\hbox {M}\) queries. We note two things here: (1) the cracker index offers much slower lookups than modern main memory indexes, and (2) the cracker index gets even worse as the number of queries increase. Thus, we need to look into the index structures used in database cracking and catch up with modern indexing trends.
- 7.
Database cracking needs to improve mapping to parallel hardware We inspected several different parallel cracking algorithms that use either inter- or intra-query parallelism and compared them in terms of scaling with available hardware resources and absolute runtimes with sort-based approaches. We identified lock contention and the shared memory bus as main limitations for parallel cracking algorithms. In terms of absolute query response times, the sorting methods are a hard match for their cracking-based competitors and offer nice additional properties like interesting orders—however, only if a large number of threads is available. This picture is confirmed in the tuple reconstruction case, where parallel sideways cracking is the winner over parallel clustering only under limited computing resources. Skew affects the parallel algorithms at different degrees depending on its type: a higher skewness is preferred by most algorithms although, e.g. clustered input heavily throttles certain methods in their current realizations.
- 8.
Different indexing methods have different signatures We looked at several indexing techniques in this paper. Let us now contrast the indexing behaviour of different indexing methods in a nutshell. To do so, we introduce a way to fingerprint different indexing methods. We measure the progress of index creation over the progress of query processing, i.e. how different indexing methods index the data over time as the queries are being processed (Fig. 24). This measure essentially acts as a signature of different indexing methods. The x-axis shows the normalized accumulated lookup and data access time (querying progress) and the y-axis shows the normalized accumulated data shuffle and index update time (indexing progress). We can see that different indexing methods have different curves. For example, standard cracking gradually builds the index as the queries are processed, whereas full index builds the entire index before processing any queries. Hybrid crack sort and hybrid sort have steeper curves than standard cracking, indicating that they build the index more quickly. On the other hand, stochastic cracking has a much smoother curve. Sideways and covered cracking perform large parts of their querying process already in the first query by copying table columns into the index to speed up tuple reconstruction. It is interesting to see that each method has a unique curve which characterizes its indexing behaviour. Furthermore, there is still lot more room to design adaptive indexing algorithms with even more different indexing signatures.
Signatures of indexing methods
Note that the query time of full scan varies by as much as 4 times. This is because of lazy evaluation in the filtering depending on the position of low key and high key in the value domain.
After the first few queries, cracking mostly performs a pair of crack-in-two operations as the likelihood of two splits falling in two different partitions increases with the number of applied queries.
Please note that our current implementation relies on a uniform key distribution to create equal-sized partitions. Handling skewed distributions would require the generation of equi-depth partitions.
Acknowledgments
Special thanks to Stratos Idreos for helping us in understanding the hybrid methods. Work partially supported by BMBF.
Compliance with ethical standards
Competing interests
As we re-evaluate research, there are potential competing interests with CWI Amsterdam and the authors of [8].