Optimising memory for Aggregates and Join operators in Apache Impala.

Analytical SQL workloads use aggregates and joins heavily. Hence, optimising such operators for both performance and efficiency in analytical engines like Apache Impala can be very beneficial to users. In this blog, we will look into one of the techniques used in Apache Impala to reduce peak memory usage of Aggregates and Joins by up to 50% and peak node memory usage by 18% at node level on TPC-DS 10000 workload.

Hash Table

These are the structures of Bucket and DuplicateNode (few details changed for simplicity):

When evaluating size for struct these are some of the rules for memory alignment assuming 64-bit system:

  1. Memory address for individual members starts at memory address divisible by its size i.e., pointer will start at memory divisible by 8, bool at 1 and uint32_t by 4. Members will be preceded by padding if needed to make sure the starting address is divisible by their size.
  2. Size of struct will be aligned to the it’s largest member. For instance, in both the struct above the largest member is a pointer of size 8 bytes. Hence, the size of struct will be multiple of 8 too.

Based on the above rules, Bucket in above snippet is commented with size occupied by every member and padding wherever required. Total size of the Bucket is 16 bytes. Similarly, the total size of DuplicateNode is 24 bytes.

We decided to reduce the size of structs by removing bool from both struct Bucket and DuplicateNode to reduce size to 12 bytes and 16 bytes respectively. However 12 bytes is not a valid size of Bucket as it needs to be multiple of 8 bytes (size of largest member of struct). In such cases, we can use __attribute__ ((packed)) to ensure struct packing so that size is 12 bytes.

How do we achieve removing these booleans as they need to be present for every Bucket or DuplicateNode ?

tl;dr: We decided to remove all bool members by folding it into a pointer already part of the struct.

Folding data into pointers

Figure 1. Level 5 64-bit memory address

On 64-bit architecture pointers store memory addresses using 8 bytes. But on architectures like x86 and ARM the linear address is limited to 48 bits long with bits 49 to 64 reserved for future usage. In future with Intel’s level 5 paging proposal (whitepaper), it is planning to relax the limit to 57-bit on x86, which means we can use most significant 7 bits i.e., 58 to 64 bits to store extra data. One caveat is, even if just 48 bits out of 64 bits are needed to read a memory, the processor checks if significant bits (48…64) are identical i.e., sign extended. If not, such an address will cause a fault. It means folded pointers may not always be storing a valid addressable memory. Hence folded pointers need to be sign extended before dereferencing.

We use the above technique to fold filled, matched and hasDuplicates into pointer bucketData. After folding and struct packing we will get a Bucket size of 12 bytes. Similarly DuplicateNode can be reduced to 16 bytes instead of 24 bytes.

Other requirements

  1. Internal memory allocator allocates memory in power of 2 to avoid internal fragmentation. Hence, number of buckets * sizeof(Bucket) should be power of 2.
  2. Number of buckets (‘N’) being the power of 2 enables faster modulo operation.

Instead of slow modulo operation (hash % N), faster bitwise operation (hash & (N-1)) can be used when N is power of 2.

Due to this, a 4 bytes hash field from Bucket is removed and stored separately in a new array hash_array_ in HashTable class. This ensures sizeof(Bucket) is 8 which is power of 2. Another advantage of separating hash is that Bucket is not required to be packed now.

Experimental evaluation:

  1. Microbenchmark: We ran the build and probe methods multiple times on a smaller number of rows around 60 times to evaluate the performance and memory consumed.
  2. Billion-Row benchmark: On a single daemon, we ran build and probe benchmark for a billion rows to measure the performance and memory consumed.
  3. TPC-DS-10000: Entire TPC-DS benchmark of scale 10000 was run on a 17-node cluster to measure the performance. It also measured peak memory consumed at node and operator level.

Microbenchmark

Figure 2a. Memory Benchmark
Figure 2b. Runtime Benchmark

Figure 2a shows the results of the memory benchmark. Benchmark names are in format memory_XX_YY where XX is the number of values being inserted into the hash table whereas YY represents the percentage of unique values. We see a reduction in memory consumed by upto 30% on building the hash table.

Figure 2b shows the results of the performance benchmark. build_XX_YY represents the build benchmark where XX values were inserted and YY is the percentage of unique values. Similarly probe_XX_YY would probe against a hash table built with XX rows and YY unique values. These benchmarks were run 60 times and each time around 10 times to find out iterations per milliseconds. Figure 2b shows the 90th percentile of the number of iterations measured for those 60 runs. We observe no significant difference in the runtime due to change.

Billion-Row benchmark

Build Benchmark

Query: select count(*) from sales group by s_item_id having count(*) > 9999999999;

Figure 3a. Memory usage in GroupBy query over 1 billion rows

As shown in Figure 3a, we saw peak allocation reducing by 17% and cumulative allocation reducing by 21%. On running 20 times, we didn’t see any performance degradation. Geomean with changes and without changes were around 68 seconds in both cases.

Probe Benchmark

However, we created 3 kinds of sales table for this purpose:

  1. sales_base: It has randomly generated 1 billion rows same as what was used in Build Benchmarks
  2. sales_30: It has 1 billion rows with 30% of the rows unique.
  3. sales_60: It has 1 Billion rows with 60% of the rows unique.

We saw similar performance in both runs with our changes being slightly faster on ‘sales_base’ as shown in Figure 3b.

Figure 3b. Probe times with 1 Billion rows

TPCDS-10000 scale

Figure 4a. TPC-DS queries with memory reduction at operator level

Per-Operator Reduction:

Per-Node memory reduction:

Figure 4b. Average peak memory reduction across all nodes
Figure 4c. Reduction in max peak memory across all nodes

On considering max-peak memory consumed in any node for a query, more than 27 queries show reduction by 5% and more than 11 query show reduction more than 10% as shown in Figure 4c. Maximum reduction observed was more than 20% for q65.

Conclusion

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store