SQL Join Optimizations in Qubole Presto

  • Dynamic Filtering
  • Dynamic Filtering provides 2.8X geomean improvement and 14X maximum improvement.
  • Due to better resource utilization from these optimizations, Qubole Presto can now run 90 TPC-DS queries compared to 72 without these optimizations.
  • Data was stored in HDFS instead of S3
  • No proprietary Qubole features like Qubole Rubix, autoscaling or spot node support were used.
  • Timings published are average of 3 runs.
  • The other table is used to probe the hash table and find keys that match. This table is called the probe side and is typically notated on the left side.
  1. Below figure shows 21 queries that saw more than 1.25x improvement on enabling join reordering:
  1. Predicate Pushdown: When data is stored in columnar formats like ORC/Parquet, pushing predicates like A.date_key IN d” will reduce disk I/O.
  2. Filtering just after Table Scan: Filtering just after Table Scan can avoid sending more data to later operators and save on Network I/O and memory.
  • Runtime of 13 queries improved between 3X — 5X.
  • Runtime of 22 queries improved between 1.5X — 3X.
  • 14 queries that did not run before succeeded.
  • Queries that failed earlier and passing now: There were 14 queries which failed earlier without Dynamic Filtering and passes when Dynamic Filter is enabled. Out of 104 queries without Dynamic Filtering we could run on only 72 queries and with Dynamic Filter we could run on 86 queries. Queries were failing earlier due to OOM errors. Now with Dynamic filtering lesser data is sent to operators like Join causing lesser memory consumption and hence more queries passing than earlier.

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store