Automatic Statistics Collection For Better Query Performance | Qubole

Table Statistics in Hive

The Apache Hive Statistics wiki page contains a good background on the list of statistics that can be computed and stored in the Hive metastore. Since statistics collection is not automated, we considered the current solutions available to users to capture table statistics on an ongoing basis. These are described below:

  • Not supported for SparkSQL
  • Increased query run time.
  • Doesn’t catch DDL/DML Statements like:
  • CREATE EXTERNAL TABLE
  • ALTER TABLE/PARTITION
  • INSERT INTO
  • Doesn’t catch direct writes to S3 locations for external tables
  • Manual effort on the User’s side
  • Does not cover all tables automatically
  • Unnecessary runs might occur even if the tables are not updated/modified

Statistics Collection Service

The requirements for the collection service are:

  • Use ANALYZE COMPUTE STATISTICS statement in Apache Hive to collect statistics.
  • ANALYZE statements should be triggered for DML and DDL statements that create tables or insert data on any query engine.
  • ANALYZE statements should be transparent and not affect the performance of DML statements.
  • Number of files
  • Physical size in bytes
  • number of distinct values
  • number of NULL values
  • number of TRUE and FALSE (in case of booleans)
  • min/max values
  • average/max length of the column

Architecture Diagram

The diagram below shows how to ANALYZE COMPUTE STATISTICS statements are triggered in QDS (In Hive Tier case):

  1. A user issues a Hive or Spark command.
  2. If this command is a DML or DDL statement, the metastore is updated.
  3. A custom MetastoreEventListener is triggered.
  4. The triggers call back to the QDS Control plane and launch an ANALYZE command for the target table of the DML statement. Internally, the ANALYZE query will be executed like any other Hive command on the cluster on which the user issued the original command.

Throttling & Batching

As we have described before — running Analyze commands takes compute bandwidth and has cost and performance implications (for other applications). There are limits on the number of concurrent ANALYZE COMPUTE STATISTICS commands. There are three tiers:

  • Global limit: Controls concurrency across QDS control plane
  • Account limit: Controls concurrency within each QDS account
  • Table limit: Controls concurrency for each table

Status and Future work

Automatic Statistics Collection is now available on the QDS platform for Beta usage. Please contact Qubole Support at [email protected] to try this out. We have seen a significant performance boost for analytic queries in Apache Hive and SparkSQL using table statistics and we are working with the Presto community to incorporate statistics in Presto query planning as well.

  • Users will be able to see table statistics in the preview.
  • The QDS platform can give recommendations and insights on better partitioning or data organization strategies to data engineers, architects, or admins based on statistics.

References

  1. Tuning and Optimizing an end to end benchmark, Yi Zhou, Intel SSG/STO/Big Data Technology
  2. Apache Hive Statistics — Apache Hive Wiki

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store