SparkSQL Query Optimizer

From GM-RKB
Jump to navigation Jump to search

A SparkSQL Query Optimizer is a SQL query optimizer for SparkSQL.



References

2017

  • Ron Hu, Zhenhua Wang, Wenchen Fan and Sameer Agarwal . (2017). “Cost Based Optimizer in Apache Spark 2.2." Posted in ENGINEERING BLOG August 31, 2017
    • QUOTE: Apache Spark 2.2 recently shipped with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, average/max length, etc.) to improve the quality of query execution plans. Leveraging these statistics helps Spark to make better decisions in picking the most optimal query plan. Examples of these optimizations include selecting the correct build side in a hash-join, choosing the right join type (broadcast hash-join vs. shuffled hash-join) or adjusting a multi-way join order, among others.

      In this blog, we’ll take a deep dive into Spark’s Cost Based Optimizer (CBO) and discuss how Spark collects and stores these statistics, optimizes queries, and show its performance impact on TPC-DS benchmark queries.

2015

  • https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
    • QUOTE: Spark SQL is one of the newest and most technically involved components of Spark. It powers both SQL queries and the new DataFrame API. At the core of Spark SQL is the [[SparkSQL Query Optimizer}Catalyst optimizer]], which leverages advanced programming language features (e.g. Scala’s pattern matching and quasiquotes) in a novel way to build an extensible query optimizer.

      We recently published a paper on Spark SQL that will appear in SIGMOD 2015 (co-authored with Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, and Ali Ghodsi). In this blog post we are republishing a section in the paper that explains the internals of the Catalyst optimizer for broader consumption.