SQL Join Optimizations in Qubole Presto

  • Join Reordering
  • Dynamic Filtering
  • Join Reordering provides a maximum improvement of 6X.
  • Dynamic Filtering provides 2.8X geomean improvement and 14X maximum improvement.
  • Due to better resource utilization from these optimizations, Qubole Presto can now run 90 TPC-DS queries compared to 72 without these optimizations.
  • Default Presto configuration was used. For example distributed joins are used (default) instead of broadcast joins.
  • Data was stored in HDFS instead of S3
  • No proprietary Qubole features like Qubole Rubix, autoscaling or spot node support were used.
  • Timings published are average of 3 runs.
  • One of the tables is used to build a hash table. This table is called the build side and typically notated on the right side.
  • The other table is used to probe the hash table and find keys that match. This table is called the probe side and is typically notated on the left side.
  1. 17 queries which were failing earlier with Out of Memory errors without join reordering pass when join reordering is enabled. This shows that not only performance but join reordering can help to reduce memory consumption of the queries.
  2. Below figure shows 21 queries that saw more than 1.25x improvement on enabling join reordering:
  1. Partition pruning: In our example if we assumed that A.date_key was a partition column, then dynamic filter A.date_key IN d” could have been used to prune partitions and save on I/O.
  2. Predicate Pushdown: When data is stored in columnar formats like ORC/Parquet, pushing predicates like A.date_key IN d” will reduce disk I/O.
  3. Filtering just after Table Scan: Filtering just after Table Scan can avoid sending more data to later operators and save on Network I/O and memory.
  • Runtime of 13 queries improved by at least 5X.
  • Runtime of 13 queries improved between 3X — 5X.
  • Runtime of 22 queries improved between 1.5X — 3X.
  • 14 queries that did not run before succeeded.
  • Speedup less than 1.5x: Around 23 queries showed speedups lesser than 1.5 times and had insignificant effect due to dynamic filters.
  • Queries that failed earlier and passing now: There were 14 queries which failed earlier without Dynamic Filtering and passes when Dynamic Filter is enabled. Out of 104 queries without Dynamic Filtering we could run on only 72 queries and with Dynamic Filter we could run on 86 queries. Queries were failing earlier due to OOM errors. Now with Dynamic filtering lesser data is sent to operators like Join causing lesser memory consumption and hence more queries passing than earlier.

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store