Optimising memory for Aggregates and Join operators in Apache Impala.
Analytical SQL workloads use aggregates and joins heavily. Hence, optimising such operators for both performance and efficiency in analytical engines like Apache Impala can be very beneficial to users. In this blog, we will look into one of the techniques used in Apache Impala to reduce peak memory usage of Aggregates and Joins by up to 50% and peak node memory usage by 18% at node level on TPC-DS 10000 workload.
Hash Table
Both Aggregates and Joins in Impala uses Hash Table and we will show how we reduced its size for the operation. HashTable class implementation in Impala comprises a contiguous array of Bucket and each Bucket contains either pointer to data or a pointer to a linked list of duplicate entries named DuplicateNode.
These are the structures of Bucket and DuplicateNode (few details changed for simplicity):
When evaluating size for struct these are some of the rules for memory alignment assuming 64-bit system:
- Memory address for individual members starts at memory address divisible by its size i.e., pointer will start at memory divisible by 8, bool at 1 and uint32_t by 4. Members will be preceded by padding if needed to make sure the starting address is divisible by their size.
- Size of struct will be aligned to the it’s largest member. For instance, in both the struct above the largest member is a pointer of size 8 bytes. Hence, the size of struct will be multiple of 8 too.
Based on the above rules, Bucket in above snippet is commented with size occupied by every member and padding wherever required. Total size of the Bucket is 16 bytes. Similarly, the total size of DuplicateNode is 24 bytes.
We decided to reduce the size of structs by removing bool from both struct Bucket and DuplicateNode to reduce size to 12 bytes and 16 bytes respectively. However 12 bytes is not a valid size of Bucket as it needs to be multiple of 8 bytes (size of largest member of struct). In such cases, we can use __attribute__ ((packed)) to ensure struct packing so that size is 12 bytes.
How do we achieve removing these booleans as they need to be present for every Bucket or DuplicateNode ?
tl;dr: We decided to remove all bool members by folding it into a pointer already part of the struct.
Folding data into pointers
Intel Level 5 proposal 64-bit memory address
On 64-bit architecture pointers store memory addresses using 8 bytes. But on architectures like x86 and ARM the linear address is limited to 48 bits long with bits 49 to 64 reserved for future usage. In future with Intel’s level 5 paging proposal (whitepaper), it is planning to relax the limit to 57-bit on x86, which means we can use most significant 7 bits i.e., 58 to 64 bits to store extra data. One caveat is, even if just 48 bits out of 64 bits are needed to read a memory, the processor checks if significant bits (48…64) are identical i.e., sign extended. If not, such an address will cause a fault. It means folded pointers may not always be storing a valid addressable memory. Hence folded pointers need to be sign extended before dereferencing.
We use the above technique to fold filled, matched and hasDuplicates into pointer bucketData. After folding and struct packing we will get a Bucket size of 12 bytes. Similarly DuplicateNode can be reduced to 16 bytes instead of 24 bytes.
Other requirements
In our implementation, there is a requirement regarding size of Bucket and number of buckets in hash table to be power of 2. These requirements are for the following reasons:
- Internal memory allocator allocates memory in power of 2 to avoid internal fragmentation. Hence, number of buckets * sizeof(Bucket) should be power of 2.
- Number of buckets (‘N’) being the power of 2 enables faster modulo operation.
Instead of slow modulo operation (hash % N), faster bitwise operation (hash & (N-1)) can be used when N is power of 2.
Due to this, a 4 bytes hash field from Bucket is removed and stored separately in a new array hash_array_ in HashTable class. This ensures sizeof(Bucket) is 8 which is power of 2. Another advantage of separating hash is that Bucket is not required to be packed now.
Experimental evaluation:
We did extensive evaluation of the technique to check how it affects the performance and memory. We used 3 benchmarks for the same:
- Microbenchmark: We ran the build and probe methods multiple times on a smaller number of rows around 60 times to evaluate the performance and memory consumed.
- Billion-Row benchmark: On a single daemon, we ran build and probe benchmark for a billion rows to measure the performance and memory consumed.
- TPC-DS-10000: Entire TPC-DS benchmark of scale 10000 was run on a 17-node cluster to measure the performance. It also measured peak memory consumed at node and operator level.
Microbenchmark
Figure 2a shows the results of the memory benchmark. Benchmark names are in format memory_XX_YY where XX is the number of values being inserted into the hash table whereas YY represents the percentage of unique values. We see a reduction in memory consumed by upto 30% on building the hash table.
Figure 2b shows the results of the performance benchmark. build_XX_YY represents the build benchmark where XX values were inserted and YY is the percentage of unique values. Similarly probe_XX_YY would probe against a hash table built with XX rows and YY unique values. These benchmarks were run 60 times and each time around 10 times to find out iterations per milliseconds. Figure 2b shows the 90th percentile of the number of iterations measured for those 60 runs. We observe no significant difference in the runtime due to change.
Billion-Row benchmark
We used a sales and items table for this benchmark. sales had columns s_item_id (int), s_quantity(int) ,s_date(date), whereas items had columns i_item_id (int)and i_price (double). sales had 1 Billion rows and items had 30 million rows.
Build Benchmark
We ran a group by query on sales to measure the performance and memory of building a hash table.
Query: select count(*) from sales group by s_item_id having count(*) > 9999999999;
As shown in Figure 3a, we saw peak allocation reducing by 17% and cumulative allocation reducing by 21%. On running 20 times, we didn’t see any performance degradation. Geomean with changes and without changes were around 68 seconds in both cases.
Probe Benchmark
For measuring the probe we ran a join query between items and sales where sales is on the probe side and items is on the build side. Since we are building a hash table only on a smaller table in the join proposed, the goal of this benchmark was not to measure the reduction in memory but to measure any performance difference in probing 1 Billion rows via sales table.
However, we created 3 kinds of sales table for this purpose:
- sales_base: It has randomly generated 1 billion rows same as what was used in Build Benchmarks
- sales_30: It has 1 billion rows with 30% of the rows unique.
- sales_60: It has 1 Billion rows with 60% of the rows unique.
We saw similar performance in both runs with our changes being slightly faster on ‘sales_base’ as shown in Figure 3b.
TPCDS-10000 scale
We implemented the above technique and evaluated it against a TPC-DS workload of scale 10000 in Parquet format. We ran all the workload queries on a 17 node cluster with data being in HDFS.
Per-Operator Reduction:
For every query we computed the maximum percentage of reduction in memory for individual Join and Aggregation operators. We considered only the operators greater than 10 MB. We found 42 out of 99 queries memory consumption reduced by more than 10% and for 24 queries it was found memory consumption was reduced more than 20% as shown in Figure 4a.
Per-Node memory reduction:
On computing average peak memory consumption for nodes involved, 28 queries showed greater than 5% reduction in memory and 11 queries showed more than 10% reduction as shown in Figure 4b. We saw around 18% reduction at max for q72.
On considering max-peak memory consumed in any node for a query, more than 27 queries show reduction by 5% and more than 11 query show reduction more than 10% as shown in Figure 4c. Maximum reduction observed was more than 20% for q65.
Conclusion
As shown in the previous section we saw significant reduction in memory both at the node level and operator level without any performance degradation. I would like to encourage folks to try Apache Impala out and reach out to the community via user@impala.apache.org or dev@impala.apache.org to share your feedback.