Version: 0.2.0

Bucket table Description

In ByConity, when using a Bucket table, the system organizes table data based on one or more columns and expressions specified by the user in the table creation statement. The data with the same value is clustered together and assigned to the same bucket number.

Benefits of using Bucket table

The following benefits can be obtained on large tables by using cluster key to aggregate data:

Cluster key point checks can filter out most of the data, reducing the amount of IO needed to obtain shorter execution time and higher concurrent QPS.
For aggregation calculations of the cluster key, computing nodes can pre-calculate subsets of data to achieve smaller memory usage and shorter execution times.
Joining two or more tables for the cluster key can be optimized through co-located joins, which greatly reduce the amount of shuffle data and result in shorter execution times.

When to consider using Bucket table

The table is sufficiently large, which means that the number of partitions under a partition needs to be significantly higher than the number of workers.
Query statements can take advantage of these benefits.

How to choose a cluster key

The cluster key can consist of one or more columns and expressions. It is recommended to use up to 3 fields. Using more fields typically results in higher write costs and provides smaller benefits to query statements.The Cluster key can be one or more columns and expressions. It is recommended to use up to 3 fields. More fields usually introduce high write costs and the scope of benefit statements is smaller.

Choosing the correct cluster key is crucial for optimizing performance, so it should be selected carefully. Generally, the following principles can be followed:

Columns that are often used for equality and IN filtering
Commonly used aggregation columns
Multi-table join key

If the above scenario is commonly used in combination of two columns, such as a = 1 and b = 2, then cluster key can get better results by selecting two columns.

Another dimension to consider is the number of distinct values for a column, given that

The data of the same cluster key will belong to the same bucket number
All parts under a bucket number will be sent to the same worker for calculation

We can conclude that

To utilize all worker nodes for computation, the number of distinct values needs to exceed the number of workers.
If the number of distinct values is small, it is preferable to select the largest distinct value, preferably a multiple of the number of workers, to achieve a more balanced data distribution during querying.

For example:

In the table test, the commonly used filter columns are c1, c2, c3, and the columns are independent of each other
The number of workers is 30
distinct c1 is 6
distinct c2 is 8
distinct c3 is 5

It can be observed that the distinct values of each column are all smaller than the number of workers. The number of distinct values for c1 and c2 is 48, which is greater than the number of workers, but it is not a multiple of the number of workers, making it unsuitable for a cluster key. The number of distinct values for c1 and c3 is 30, which is exactly one times the number of workers; however, for a more balanced data distribution, it is better to choose a larger value of distinct c1, c2, and c3, such as 8 times the number of workers.

How to choose bucket number

Given that within a partition

Each row of data will be calculated according to the value of the cluster key by % BUCKETS to get the corresponding bucket number
Parts of the same bucket number will be sent to the same worker node for calculation during query

Therefore, choosing an appropriate bucket number has a major impact on storage and query, and generally has the following principles:

Ensure that the bucket number is a multiple of the number of workers to achieve balanced allocation during querying. It is generally recommended to set it to 1 or 2 times the number of workers (with room for expanding worker node redundancy) if there are enough worker nodes.
Ensure that there is enough data in a bucket number under a partition and avoid generating overly small parts. For a relatively small table, ensure that the size of a bucket number part in a partition exceeds 1GB. Do not set the bucket number too high, as a bucket number smaller than the number of workers may be generated.

How to decide whether to modify the cluster by definition

During operation, due to data changes, query mode changes, and changes in the number of worker nodes, users may want to reset the cluster key and bucket number.

Here you need to consider the cost of implementing the modification, and weigh whether you need to modify and when:

To modify the definition of cluster by, re-clustering on the existing data is required. It is necessary to assess the amount of existing data to estimate the execution time of the re-clustering process.
During the re-clustering period, querying existing data will revert to a normal table, and all the benefits of the bucket table will be temporarily lost.
Re-clustering will occupy the resources of the write worker. It is necessary to evaluate whether the current cnch cluster has an independent write worker, the current load, and the impact on existing tasks such as querying and merging.

There are two cases of modifying the cluster by definition:

Modifying the cluster key:

At this point, the current data can no longer benefit from the bucket table, so the lost income during re-clustering need not be considered.
It is necessary to evaluate the impact of the re-clustering task on existing tasks to determine whether it can be executed.

Modifying the bucket number:

The current bucket table still provides benefits, so it is necessary to confirm with the business side the acceptable performance rollback time, assess whether it can be executed based on the re-clustering time, and determine the start execution time.
It is also necessary to evaluate the impact of the re-clustering task on existing tasks to determine whether it can be executed.