跳到主要内容
版本:0.2.x

查询优化器

优化器是数据库系统的核心,优秀的优化器能极大提高查询性能,特别是在复杂查询场景下优化器能带来数倍至数百倍的性能提升。

ByConity 自研优化器基于四个大的优化方向提供极致优化能力:

  • RBO:基于规则的优化能力。支持:列裁剪、分区裁剪、表达式简化、子查询解关联、谓词下推、冗余算子消除、Outer-JOIN 转 INNER-JOIN、算子下推存储、分布式算子拆分等常见的启发式优化能力。
  • CBO:基于代价的优化能力。支持:Join Reorder、Outer-Join Reorder、Join/Agg Reorder、CTE、物化视图、Dynamic Filter 下推、Magic Set 等基于代价的优化能力。并且面向分布式计划融合了 Property Enforcement。
  • DBO:基于数据依赖的优化能力。支持:唯一键、functional dependency、Order dependency、Inclusion dependency 等基于数据依赖关系的优化能力。

Turn on Query Optimizer

set dialect_type ='ANSI' or set enable_optimizer =1

If turn on ANSI mode, the optimizer will be turned on automatically.

Commands for collecting, displaying and deleting statistical information

Syntax Definition

  • Subsequent versions will have automatic collection, update and analysis of statistical information
-- NOTE: ALL means currentDatabase().*

-- specifying columns and sample config is supportted
CREATE STATS [IF NOT EXISTS] (<db>.* | ALL | [<db>.]<table>['(' <col1>, <col2>, ..., <coln> ')']) [WITH (FULLSCAN | SAMPLE [<rows> ROWS] [<ratio> RATIO])]

-- Delete statistics
-- specifying columns is supported
DROP STATS (<db>.* | ALL | [<db>.]<table>['(' <col1>, <col2>, ..., <coln> ')' ])

-- specifying columns is supported
SHOW (STATS|COLUMN_STATS) (<db>.* | ALL | [<db>.]<table>'('<col1>, <col2>, ..., <coln> ')' )

Syntax Example:

Create Statistics


-- create stats for all table in current database which has no statistics
create stats if not exists all;
-- create stats for all table in db1 which has no statistics
create stats if not exists db1.*;

-- create stats for t1.c1,
create stats t1(c1);
-- create stats for t1.c1, t1.c2,
create stats t1(c1, c2);

-- create stats on t1, force to use fullscan,
-- otherwise controlled by statistics_enable_sample
create stats t1 with fullscan
-- sample MAX(**statistics_sample_row_count**, 0.01 * full_rows_count) rows when create stats for each table
create stats all with 0.01 ratio;
-- sample MAX(40000000, 0.01 * full_rows_count) rows when create stats for each table
create stats all with 0.01 ratio 40000000 rows;
-- sample with default settings,
-- controlled by **statistics_sample_row_count** and **statistics_sample_ratio
create stats all with sample;

Show Statistics


-- show table statistics for all tables in current database
show stats all;
-- show table statistics for all tables in db1
show stats db1.*;
-- show table statistics for table t1
show stats t1;
-- show column statistics for table t1
show column_stats t1;

-- show column statistics for column c1, c2 of table t1
show column_stats t1(c1, c2);

Drop Statistics


-- show statistics for all tables in current database
drop stats all;
-- show statistics for all tables in db1.*
drop stats db1.*;
-- show statistics for table t1
drop stats t1;


-- drop statistics for column c1, c2 of table t1
drop stats t1(c1, c2);

Usage of Explain

Explain

  • Explain + SQL
EXPLAIN
WITH
(
SELECT
sr_customer_sk AS ctr_customer_sk,
sr_store_sk AS ctr_store_sk,
sum(sr_return_amt) AS ctr_total_return
FROM store_returns
, date_dim
WHERE (sr_returned_date_sk = d_date_sk) AND (d_year = 2000)
GROUP BY
sr_customer_sk,
sr_store_sk
) AS customer_total_return
SELECT c_customer_id
FROM customer_total_return AS ctr1
, store
, customer
WHERE (ctr1.ctr_total_return >
(
SELECT avg(ctr_total_return) * 1.2
FROM customer_total_return AS ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk
)) AND (s_store_sk = ctr1.ctr_store_sk) AND (s_state = 'TN') AND (ctr1.ctr_customer_sk = c_customer_sk)
ORDER BY c_customer_id ASC
LIMIT 100

  • Result of Explain

Explain Analyze

Explain analyze + SQL

EXPLAIN ANALYZE LOGICAL
WITH customer_total_return AS

(
SELECT
sr_customer_sk AS ctr_customer_sk,
sr_store_sk AS ctr_store_sk,
sum(sr_return_amt) AS ctr_total_return
FROM store_returns
, date_dim
WHERE (sr_returned_date_sk = d_date_sk) AND (d_year = 2000)
GROUP BY
sr_customer_sk,
sr_store_sk
)
SELECT c_customer_id
FROM customer_total_return AS ctr1
, store
, customer
WHERE (ctr1.ctr_total_return >
(
SELECT avg(ctr_total_return) * 1.2
FROM customer_total_return AS ctr2
WHERE ctr1.ctr_store_sk = ctr2.ctr_store_sk
)) AND (s_store_sk = ctr1.ctr_store_sk) AND (s_state = 'TN') AND (ctr1.ctr_customer_sk = c_customer_sk)
ORDER BY c_customer_id ASC
LIMIT 100

┌─Explain Analyze──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Limit │
│ │ Est. 100 rows, cost 8.2e+05 │
│ │ Act. 100 rows (2.4 KB), Throughput: 545.0K rows/s, CPU Time per worker: avg 2.0 us, max 2.0 us, Wall Time: 1.8 ms │
│ │ Filtered: 0.0%, QError: 1.0 │
│ └─ MergeSorting │
│ │ Est. 49.2K rows, cost 8.2e+05 │
│ │ Act. 100 rows (2.4 KB), Throughput: 545.0K rows/s, CPU Time per worker: avg 2.0 us, max 2.0 us, Wall Time: 1.8 ms │
│ │ Filtered: 96.9%, QError: 504.1 │
│ └─ Union │
│ │ Est. 49.2K rows, cost 8.2e+05 │
│ │ Act. 3.1K rows (78.1 KB), Throughput: 17.0M rows/s, CPU Time per worker: avg 0.0 us, max 0.0 us, Wall Time: 1.8 ms │
│ │ Filtered: 0.0%, QError: 15.8 │
│ └─ PartialSorting │
│ │ Est. 49.2K rows, cost 8.2e+05 │
│ │ Act. 3.1K rows (78.1 KB), Throughput: 17.0M rows/s, CPU Time per worker: avg 40.0 us, max 40.0 us, Wall Time: 1.8 ms │
│ │ Filtered: 0.0%, QError: 15.8 │
│ └─ Gather Exchange │
│ │ Est. 49.2K rows, cost 8.2e+05 │
│ └─ MergeSorting │
│ │ Est. 49.2K rows, cost 8.1e+05 │
│ │ Act. 12.5K rows (313.1 KB), Throughput: 108.2M rows/s, CPU Time per worker: avg 23.0 us, max 25.0 us, Wall Time: 1.2 ms │
│ │ Filtered: -1.6%, QError: 3.9 │
│ └─ PartialSorting │
│ │ Est. 49.2K rows, cost 8.1e+05 │
│ │ Act. 12.3K rows (308.3 KB), Throughput: 91.6M rows/s, CPU Time per worker: avg 40.0 us, max 40.0 us, Wall Time: 1.3 ms │
│ │ Filtered: 0.0%, QError: 4.0 │
│ └─ Inner Join │
│ │ Est. 47.5K rows, cost 8.1e+05 │
│ │ Act. 12.3K rows (308.3 KB), Throughput: 91.6M rows/s, CPU Time per worker: avg 41.6 ms, max 41.8 ms, Wall Time: 1.3 ms │
│ │ Filtered: 0.0%, QError: 3.9 │
│ │ Condition: sr_store_sk_1 == s_store_sk │
│ │ Filter: `sum(sr_return_amt)_1` > (`avg(ctr_total_return)` * 1.2) │
│ ├─ Local Exchange │
│ │ │ Est. 49.2K rows, cost 2.9e+05 │
│ │ └─ Inner Join │
│ │ │ Est. 49.2K rows, cost 2.9e+05 │
│ │ │ Act. 49.3K rows (2.1 MB), Throughput: 542.3M rows/s, CPU Time per worker: avg 6.2 ms, max 6.4 ms, Wall Time: 896.0 us │
│ │ │ Filtered: 0.0%, QError: 1.0 │
│ │ │ Condition: c_customer_sk == sr_customer_sk_1 │
│ │ ├─ Repartition Exchange │
│ │ │ │ Est. 97.7K rows, cost 9.0e+04 │
│ │ │ │ Partition by: {c_customer_sk} │
│ │ │ └─ ReadFromStorage tpcds.customer │
│ │ │ Est. 97.7K rows, cost 7.4e+04 │
│ │ │ Act. 97.7K rows (3.1 MB), Throughput: 22.2G rows/s, CPU Time per worker: avg 362.0 us, max 402.0 us, Wall Time: 47.0 us │
│ │ │ Filtered: 0.0%, QError: 1.0 │
│ │ └─ Repartition Exchange │
│ │ │ Est. 50.9K rows, cost 6.1e+04 │
│ │ │ Partition by: {sr_customer_sk_1} │
│ │ └─ CTERef [0] │
│ │ Est. 50.9K rows, cost 5.2e+04 │
│ └─ Broadcast Exchange │
│ │ Est. 7 rows, cost 5.6e+04 │
│ └─ Inner Join │
│ │ Est. 7 rows, cost 5.6e+04 │
│ │ Act. 6 rows (102.0 Bytes), Throughput: 67.7K rows/s, CPU Time per worker: avg 280.0 us, max 360.0 us, Wall Time: 870.0 us │
│ │ Filtered: 97.9%, QError: 1.2 │
│ │ Condition: s_store_sk == sr_store_sk_2 │
│ ├─ Repartition Exchange │
│ │ │ Est. 12 rows, cost 1.2e+01 │
│ │ │ Partition by: {s_store_sk} │
│ │ └─ Filter │
│ │ │ Est. 12 rows, cost 9.8e+00 │
│ │ │ Act. 12 rows (240.0 Bytes), Throughput: 28.6M rows/s, CPU Time per worker: avg 16.0 us, max 24.0 us, Wall Time: 5.0 us │
│ │ │ Filtered: 0.0%, QError: 1.0 │
│ │ │ s_state = 'TN' │
│ │ └─ ReadFromStorage tpcds.store │
│ │ Est. 12 rows, cost 8.9e+00 │
│ │ Act. 12 rows (240.0 Bytes), Throughput: 28.6M rows/s, CPU Time per worker: avg 48.0 us, max 72.0 us, Wall Time: 4.0 us │
│ │ Filtered: 0.0%, QError: 1.0 │
│ └─ MergingAggregated │
│ │ Est. 6 rows, cost 5.6e+04 │
│ │ Act. 280 rows (4.9 KB), Throughput: 3.1M rows/s, CPU Time per worker: avg 60.0 us, max 80.0 us, Wall Time: 867.0 us │
│ │ Filtered: 0.0%, QError: 46.7 │
│ └─ Repartition Exchange │
│ │ Est. 6 rows, cost 5.6e+04 │
│ │ Partition by: {sr_store_sk_2} │
│ └─ Aggregating │
│ │ Est. 6 rows, cost 5.6e+04 │
│ │ Act. 14 rows (238.0 Bytes), Throughput: 145.6K rows/s, CPU Time per worker: avg 7.0 us, max 9.0 us, Wall Time: 955.0 us │
│ │ Filtered: 100.0%, QError: 2.3 │
│ │ Group by: {sr_store_sk_2} │
│ └─ Projection │
│ │ Est. 50.9K rows, cost 5.6e+04 │
│ │ Act. 49.3K rows (887.4 KB), Throughput: 519.5M rows/s, CPU Time per worker: avg 0.0 us, max 0.0 us, Wall Time: 947.0 us │
│ │ Filtered: 0.0%, QError: 1.0 │
│ └─ CTERef [0] │
│ Est. 50.9K rows, cost 5.2e+04 │
│ CTEDef [0] │
│ MergingAggregated │
│ │ Est. 50.9K rows, cost 4.3e+05 │
│ │ Act. 49.3K rows (1.3 MB), Throughput: 497.5M rows/s, CPU Time per worker: avg 34.0 us, max 35.0 us, Wall Time: 996.0 us │
│ │ Filtered: 0.0%, QError: 1.0 │
│ └─ Repartition Exchange │
│ │ Est. 50.9K rows, cost 4.3e+05 │
│ │ Partition by: {sr_customer_sk, sr_store_sk} │
│ └─ Aggregating │
│ │ Est. 50.9K rows, cost 4.2e+05 │
│ │ Act. 51.4K rows (1.3 MB), Throughput: 1.1G rows/s, CPU Time per worker: avg 27.0 us, max 28.0 us, Wall Time: 506.0 us │
│ │ Filtered: 5.8%, QError: 1.0 │
│ │ Group by: {sr_customer_sk, sr_store_sk} │
│ └─ Inner Join │
│ │ Est. 50.9K rows, cost 4.2e+05 │
│ │ Act. 54.5K rows (1.4 MB), Throughput: 1.2G rows/s, CPU Time per worker: avg 14.3 ms, max 15.0 ms, Wall Time: 477.0 us │
│ │ Filtered: -90.6%, QError: 1.1 │
│ │ Condition: sr_returned_date_sk == d_date_sk │
│ ├─ Local Exchange │
│ │ │ Est. 280.8K rows, cost 2.3e+05 │
│ │ └─ Filter │
│ │ │ Est. 280.8K rows, cost 2.3e+05 │
│ │ │ Act. 54.5K rows (1.9 MB), Throughput: 5.7G rows/s, CPU Time per worker: avg 865.0 us, max 885.0 us, Wall Time: 96.0 us │
│ │ │ Filtered: 0.0%, QError: 5.2 │
│ │ │ Dynamic Filters: {sr_returned_date_sk:range} │
│ │ └─ ReadFromStorage tpcds.store_returns │
│ │ Est. 280.8K rows, cost 2.1e+05 │
│ │ Act. 54.5K rows (1.9 MB), Throughput: 5.8G rows/s, CPU Time per worker: avg 2.4 ms, max 2.4 ms, Wall Time: 95.0 us │
│ │ Filtered: 0.0%, QError: 5.2 │
│ └─ Projection │
│ │ Est. 366 rows, cost 6.0e+04 │
│ │ Act. 28.6K rows (228.8 KB), Throughput: 45.4G rows/s, CPU Time per worker: avg 1.1 ms, max 1.3 ms, Wall Time: 7.0 us │
│ │ Filtered: 0.0%, QError: 80.0 │
│ │ Dynamic Filters Builder: {d_date_sk} │
│ └─ Broadcast Exchange │
│ │ Est. 366 rows, cost 6.0e+04 │
│ └─ Filter │
│ │ Est. 366 rows, cost 5.9e+04 │
│ │ Act. 366 rows (6.1 KB), Throughput: 85.1M rows/s, CPU Time per worker: avg 282.0 us, max 283.0 us, Wall Time: 44.0 us │
│ │ Filtered: 0.0%, QError: 1.0 │
│ │ d_year = 2000 │
│ └─ ReadFromStorage tpcds.date_dim │
│ Est. 71.3K rows, cost 5.4e+04 │
│ Act. 366 rows (6.1 KB), Throughput: 85.1M rows/s, CPU Time per worker: avg 780.0 us, max 840.0 us, Wall Time: 43.0 us │
│ Filtered: 0.0%, QError: 199.6 │
│ note: Dynamic Filter is applied for 1 times. │
│ note: CTE(Common Table Expression) is applied for 2 times. │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Explain analyze distributed + SQL

EXPLAIN ANALYZE DISTRIBUTED
SELECT
i_item_id,
avg(ss_quantity) AS agg1,
avg(ss_list_price) AS agg2,
avg(ss_coupon_amt) AS agg3,
avg(ss_sales_price) AS agg4
FROM store_sales
, customer_demographics
, date_dim
, item
, promotion
WHERE (ss_sold_date_sk = d_date_sk) AND (ss_item_sk = i_item_sk) AND (ss_cdemo_sk = cd_demo_sk) AND (ss_promo_sk = p_promo_sk) AND (cd_gender = 'M') AND (cd_marital_status = 'S') AND (cd_education_status = 'College') AND ((p_channel_email
= 'N') OR (p_channel_event = 'N')) AND (d_year = 2000)
GROUP BY i_item_id
ORDER BY i_item_id ASC
LIMIT 100

  • QError: max(Est rows, Act rows) / min(Est rows, Act rows), 1 is the best.
  • Wall Time: max wall time of all the workers. We can figure out which step takes the most time.
  • CPU Time: We can figure out which step has the data skew issue. (If the max time is bigger than avg time, there is one worker which processed too much data.)