Optimising memory for Aggregates and Join operators in Apache Impala.

Hash Table

Both Aggregates and Joins in Impala uses Hash Table and we will show how we reduced its size for the operation. HashTable class implementation in Impala comprises a contiguous array of Bucket and each Bucket contains either pointer to data or a pointer to a linked list of duplicate entries named DuplicateNode.

  1. Memory address for individual members starts at memory address divisible by its size i.e., pointer will start at memory divisible by 8, bool at 1 and uint32_t by 4. Members will be preceded by padding if needed to make sure the starting address is divisible by their size.
  2. Size of struct will be aligned to the it’s largest member. For instance, in both the struct above the largest member is a pointer of size 8 bytes. Hence, the size of struct will be multiple of 8 too.

Folding data into pointers

Intel Level 5 proposal 64-bit memory address

Figure 1. Level 5 64-bit memory address

Other requirements

In our implementation, there is a requirement regarding size of Bucket and number of buckets in hash table to be power of 2. These requirements are for the following reasons:

  1. Internal memory allocator allocates memory in power of 2 to avoid internal fragmentation. Hence, number of buckets * sizeof(Bucket) should be power of 2.
  2. Number of buckets (‘N’) being the power of 2 enables faster modulo operation.

Experimental evaluation:

We did extensive evaluation of the technique to check how it affects the performance and memory. We used 3 benchmarks for the same:

  1. Microbenchmark: We ran the build and probe methods multiple times on a smaller number of rows around 60 times to evaluate the performance and memory consumed.
  2. Billion-Row benchmark: On a single daemon, we ran build and probe benchmark for a billion rows to measure the performance and memory consumed.
  3. TPC-DS-10000: Entire TPC-DS benchmark of scale 10000 was run on a 17-node cluster to measure the performance. It also measured peak memory consumed at node and operator level.

Microbenchmark

Figure 2a. Memory Benchmark
Figure 2b. Runtime Benchmark

Billion-Row benchmark

We used a sales and items table for this benchmark. sales had columns s_item_id (int), s_quantity(int) ,s_date(date), whereas items had columns i_item_id (int)and i_price (double). sales had 1 Billion rows and items had 30 million rows.

Build Benchmark

We ran a group by query on sales to measure the performance and memory of building a hash table.

Figure 3a. Memory usage in GroupBy query over 1 billion rows

Probe Benchmark

For measuring the probe we ran a join query between items and sales where sales is on the probe side and items is on the build side. Since we are building a hash table only on a smaller table in the join proposed, the goal of this benchmark was not to measure the reduction in memory but to measure any performance difference in probing 1 Billion rows via sales table.

  1. sales_base: It has randomly generated 1 billion rows same as what was used in Build Benchmarks
  2. sales_30: It has 1 billion rows with 30% of the rows unique.
  3. sales_60: It has 1 Billion rows with 60% of the rows unique.
Figure 3b. Probe times with 1 Billion rows

TPCDS-10000 scale

We implemented the above technique and evaluated it against a TPC-DS workload of scale 10000 in Parquet format. We ran all the workload queries on a 17 node cluster with data being in HDFS.

Figure 4a. TPC-DS queries with memory reduction at operator level

Per-Operator Reduction:

For every query we computed the maximum percentage of reduction in memory for individual Join and Aggregation operators. We considered only the operators greater than 10 MB. We found 42 out of 99 queries memory consumption reduced by more than 10% and for 24 queries it was found memory consumption was reduced more than 20% as shown in Figure 4a.

Per-Node memory reduction:

On computing average peak memory consumption for nodes involved, 28 queries showed greater than 5% reduction in memory and 11 queries showed more than 10% reduction as shown in Figure 4b. We saw around 18% reduction at max for q72.

Figure 4b. Average peak memory reduction across all nodes
Figure 4c. Reduction in max peak memory across all nodes

Conclusion

As shown in the previous section we saw significant reduction in memory both at the node level and operator level without any performance degradation. I would like to encourage folks to try Apache Impala out and reach out to the community via user@impala.apache.org or dev@impala.apache.org to share your feedback.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store