Insights and Recommendations for Hive, Presto and Spark SQL Engine Configuration | Qubole

  • The optimization function should be a function of dollar cost and throughput. Admins typically focus on one of them leading to sub-optimal configurations.
  • In general, execution engine configuration is sub-optimal in the field. In every experiment, a large fraction of the queries could be optimized by > 60%. This points to the fact that manual efforts to choose optimal configuration falls short in most cases.
  • Optimal configuration can be determined by running a workload iteratively with different values but the methodology is too expensive and impractical.
  • A simple model of the execution engine provides very good recommendations for SQL workloads. The model eliminates the need for actual execution.
  • The model is generic and is applicable to all of M/R, Tez, Presto and Spark engines.
  • The model can be used to automate and provide insights and recommendations for engine configuration.

Related Work

Existing approaches to search for optimal configuration can be broadly classified into two types:

  1. Iterative Execution: In this approach, jobs are executed multiple times with different configuration parameters to arrive at the optimal configuration. As the parameter space is huge, these approaches focus on techniques to converge towards a good configuration using a lesser number of steps. For example, Sandeep et al [3] use a gradient named `noisy gradient` to converge to a solution and apply stochastic optimization to parameter tuning.
  2. Mathematical Model: In this approach, a mathematical model is used to predict the runtime/cost of jobs for a particular configuration. The search over the parameter space to find the optimal configuration can then be performed using the model, without having to actually run the job. Examples of this approach are Starfish[1] and BigExplorer[2].

Qubole study for SQL workloads

The above methods optimize configuration from the perspective of an engine. The methods do not consider the type of workload — SQL or Programmatic (M/R or Scala code). The major advantage is that the methods are generally applicable. The major disadvantage is that the number of parameters is huge. The page list more than 100 parameters. The list makes searching the parameter space or building a model hard. SQL workloads are easier to model because there are a finite set of operators and a small set of parameters are important as described in Model-Based Execution below.

Optimization methodology

We explored two options to search the space of configuration values: iterative execution and model-based execution.

Iterative Execution

In this method, we ran Hive queries with various configuration parameters and chose the best among them. We employed the following strategies to reduce the parameter space:

  1. Parameter reduction: As described above, we focused on a small set of configuration parameters.
  2. Discretization: We further discretized each parameter so that we try few values rather than all possible values for each parameter.
  3. Reduce search range: For each parameter, there could be a large range of values that are significantly worse. We limited the search to within a good range for each dimension using heuristics. We identified a range by talking to experts.
  4. Assume dimension independence: To prevent parameter space explosion due to correlation we ignored their dependence on each other.
  1. The search space for two parameters
  2. Discrete values are explored in both axes
  3. Search space is restricted based on domain knowledge
  4. The algorithm iterates through each parameter to choose the optimal point and then moves to the next parameter.

Experimental Results:

We used the algorithm to optimize 3 customer Hive Queries. We observed the following percentage reduction over settings chosen by the Database Admins :

  • Cost: The experiment cost $5000. The customer had 1000 more queries. It is possible to make the search more efficient and reduce the number of iterations. Since customers have 100s or 1000s of queries, even a 10 or 50 fold reduction is not sufficient to make the approach economical.
  • Shadow clusters and tables: For ETL queries, the approach requires shadow clusters and queries. The queries had to be reviewed multiple times to make sure production clusters and tables were not affected. The cost in terms of man-hours was also exorbitant.

Model-Based Execution

Since iterative execution is impractical at scale, we considered a model-based approach to eliminate the execution of queries. We created an execution model that replicated an execution engine. The model is based on the reduced set of parameters only and is therefore relatively simpler to other approaches.

  1. Collect metrics from a previous run. This approach is suitable for ETL or reporting queries. In QDS, these metrics are available in the .
  2. Statistics from database catalog. This approach is suitable for ad-hoc queries. In QDS, customers can collect these statistics by turning on Automatic Statistics Collection .

Experimental Results:

To quantify the prediction error by the model, we ran an experiment on 4 queries of a customer. The graph below shows the benefit predicted by our model and the actual observed benefit for these queries. The actual savings closely match the predicted savings indicating that the model is sufficiently accurate.

Key Insights to optimize workloads

We gained a few key insights to optimize SQL workloads through multiple experiments and trials on customer queries. These are in order of priority:

Stay Tuned

This automated discovery of insights using the simple cost model for SQL workloads, the data collected through and automatic statistics collector will also be implemented for non-SQL workloads such as data science and machine learning.

  1. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B. and Babu, S., 2011, January. Starfish: A Self-tuning System for Big Data Analytics. In Cidr (Vol. 11, №2011, pp. 261–272).
  2. Chao-Chun Yeh, Jiazheng Zhou, Sheng-An Chang, Xuan-Yi Lin, Yichiao Sun, Shih-Kun Huang, “BigExplorer: A configuration recommendation system for big data platform”, Technologies and Applications of Artificial Intelligence (TAAI) 2016 Conference on , pp. 228–234, 2016, ISSN 2376–6824.
  3. Sandeep Kumar, Sindhu Padakandla, Chandrashekar L, Priyank Parihar, K Gopinath, Shalabh Bhatnagar: Performance Tuning of Hadoop MapReduce: A Noisy Gradient Approach



Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store