Automatic Statistics Collection For Better Query Performance | Qubole

Amogh Margoor, Goden Yao

Presto, Apache Spark, and Apache Hive can generate more efficient query plans with table statistics. For example, Spark, as of version 2.1.1, will perform broadcast joins only if the table size is available in the table statistics stored in the Hive Metastore (see spark.sql.autoBroadcastJoinThreshold). Broadcast joins can have a dramatic impact on the run time of everyday SQL queries where small dimension tables are joined frequently. The Big Bench tuning exercise from Intel reported a 4.22x speedup by turning on broadcast joins for specific queries.

In QDS, all the query engines use Hive Metastore as the catalog. If the Hive Metastore contains statistics, then all query engines can use them for query planning as exemplified above. But table statistics collection is not automatic. One of the goals of the Qubole platform is to apply automation to help users achieve maximum performance and efficiency. This blog post describes how we have automated Table Statistics collection in Qubole Data Service (QDS).

Table Statistics in Hive

1

User sets hive.stats.autogather=true to gather statistics automatically during INSERT OVERWRITE queries.

  • Not supported for SparkSQL
  • Increased query run time.
  • Doesn’t catch DDL/DML Statements like:
  • CREATE EXTERNAL TABLE
  • ALTER TABLE/PARTITION
  • INSERT INTO
  • Doesn’t catch direct writes to S3 locations for external tables

2

User schedules ANALYZE TABLE COMPUTE STATISTICS

User picks up a few tables they want to keep stats updated and then uses Qubole Scheduler (or an external cron job) to run the COMPUTE STATISTICS statement on these tables periodically.

  • Manual effort on the User’s side
  • Does not cover all tables automatically
  • Unnecessary runs might occur even if the tables are not updated/modified

As we can see, both of the available approaches have major gaps. This prompted us to build statistics collection into the QDS platform as an automated service.

Statistics Collection Service

  • Use ANALYZE COMPUTE STATISTICS statement in Apache Hive to collect statistics.
  • ANALYZE statements should be triggered for DML and DDL statements that create tables or insert data on any query engine.
  • ANALYZE statements should be transparent and not affect the performance of DML statements.

ANALYZE .. COMPUTE STATISTICS comes in three flavors in Apache Hive. The three flavors are described in the table below.

1

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], …)]

COMPUTE STATISTICS

NO SCAN

  • Number of files
  • Physical size in bytes

Very fast, nearly zero impact on the cluster running workloads

2

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], …)]

COMPUTE STATISTICS

(in addition to above)

Will scan entire table or partition specified, moderate impact

3

ANALYZE TABLE [db_name.]tablename

COMPUTE STATISTICS FOR COLUMNS

  • number of distinct values
  • number of NULL values
  • number of TRUE and FALSE (in case of booleans)
  • min/max values
  • average/max length of the column

Significant impact

Each flavor requires different compute resources. Another requirement for this feature is then:

In the first version, users can choose the flavor at the account level. In the future, we will let users configure the flavor at a table or schema level.

Architecture Diagram

  1. A user issues a Hive or Spark command.
  2. If this command is a DML or DDL statement, the metastore is updated.
  3. A custom MetastoreEventListener is triggered.
  4. The triggers call back to the QDS Control plane and launch an ANALYZE command for the target table of the DML statement. Internally, the ANALYZE query will be executed like any other Hive command on the cluster on which the user issued the original command.

Throttling & Batching

  • Global limit: Controls concurrency across QDS control plane
  • Account limit: Controls concurrency within each QDS account
  • Table limit: Controls concurrency for each table

Limits at account and table level are not exposed to administrators right now — but will eventually be.

If the service receives multiple triggers for the same table/partition/column, the triggers will be consolidated into one ANALYZE command. For example, if a DML statement adds many partitions, multiple requests are received by the QDS control plane within a very short time span. These requests will be batched up and a single command will compute statistics for all the new partitions.

Status and Future work

Qubole recently announced AIR (Alerts, Insights, Recommendations) in Data Platforms 2017. Automatic Statistics Collection will play a key part in our autonomous data platform. Some examples of use cases within AIR are:

  • Users will be able to see table statistics in the preview.
  • The QDS platform can give recommendations and insights on better partitioning or data organization strategies to data engineers, architects, or admins based on statistics.

References

  1. Apache Hive Statistics — Apache Hive Wiki

Originally published at https://www.qubole.com on June 19, 2017.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store