Background Tasks Configuration
Common commands
Start/Stop Merge of tables
-- Start merge
system start merges xxx.yyy;
-- Stop merge
system stop merges xxx.yyy;
Manually trigger merge
-- Trigger a merge schedule (same as the automatic process above)
optimize table xxx.yyy;
-- Trigger a merge schedule under the specified partition
optimize table xxx.yyy partition 'zzz';
In addition, there can also be a dry run mode, which is convenient for checking which parts can be selected in the next merge (or why the part cannot be selected)
-- Open tracelog, when executing try optimize
-- will print the relevant logs of the entire merge select process,
-- For example, which partitions are selected, which parts, which worker the merge task is sent to, etc.
set send_logs_level = 'trace';
try optimize table xxx.yyy;
try optimize table xxx.yyy partition 'zzz';
Modify the merge parameters
alter table xxx.yyy modify setting cnch_merge_zzz = www;
Observation
Note the host server of the selection table.
system.bg_threads can view the running status of the Merge/GC background thread of the table.
system.manipulations can view the currently running Merge tasks.
system.server_part_log can view the information of historical Merge tasks. (You can also see this table for the part of the new insert).
Settings
Configuration Items | Default Value, Meaning | Remarks |
---|---|---|
////// Multiple partitions ////// | Only select one partition each time at the earliest, and the concurrency is not enough. Changing multiple partitions is expected to improve concurrency. | |
enable_addition_bg_task | false, allow more merge tasks to run at the same time | used together with max_addition_bg_task_num. |
max_addition_bg_task_num | 200, limit the maximum number of concurrent tasks | |
max_partition_for_multi_select | 3, how many partitions to select for merge | merge will select partitions first, and then select range from partitions. If there are multiple partition keys (such as appid, date), this value usually needs to be increased because there are many partitions. |
////// Increase the speed of each range selection ////// | When the amount of data is large, each selection is very time-consuming, and it is expected to be accelerated by batch. | |
cnch_merge_enable_batch_select | false, whether to enable batch_select | batch_select will select a large number of ranges every round, and then submit these ranges directly in subsequent rounds, which can shorten the trigger cycle of merge, so that the total number of tasks can truly reach the upper limit of concurrency. |
cnch_merge_max_total_rows_to_merge | 15000000, the maximum number of rows for a single merged part | It is more useful when the business is very clear about the number of level0 part rows and expects to control the overall number after merge (affecting query time). |
////// Real-time data optimization ////// | Real-time scenario | |
cnch_merge_only_realtime_partition | false, whether to select only realtime partition | When selecting partitions for each round, N partitions with more than count and N realtime partitions will be selected. If this setting is enabled, partitions will only be selected according to the realtime level (min_block). A live table scenario would be useful. |
max_bytes_to_merge_at_max_space_in_pool | 150 1024 1024 * 1024, which is 150GB. The maximum bytes limit of a single merge task. | The parameter name does not seem intuitive enough, the legacy code problem... This parameter in cnch is used to set the max size of the merge selector. The function is similar to cnch_merge_max_total_rows_to_merge, one limits the number of rows, and the other limits the number of bytes. |
cnch_vw_write | vw_default, vw for executing merge | Consult sre to know which vw should be used as write vw for your business |
cnch_merge_pick_worker_algo | 'RM', the algorithm when selecting workers, pick from RM by default. | In a high concurrency usage, it needs to be set to 'RoundRobin' |
Instructions for use
- If the size of the table is not large (on the order of millions), but there are many parts currently, and you want to speed up the merge frequency, usually enable_addition_bg_task = 1 max_partition_for_multi_select will have an effect. At this time, the corresponding max_addition_bg_task_num is the upper limit of merge concurrency.
- If the table size is very large (tens of millions, hundreds of millions), (it will take a long time to get parts from the catalog each time), at this time you may not only need to enable_addition_bg_task, but also need to use cnch_merge_enable_batch_select = 1 to increase the number of each merge select result.
- About cnch_merge_only_realtime_partition and max_partition_for_multi_select. Usually kafka real-time tables use cnch_merge_only_realtime_partition, which means that only the latest min_block is selected each time a partition is selected, and the number of partitions is controlled by max_partition_for_multi_select. Note: If it is an offline table and a large number of new partitions are generated every day, if cnch_merge_only_realtime_partition is turned on at this time and max_partition_for_multi_select is set very small, there will be a situation where some partitions cannot participate in the merge on the same day.