Version: 0.2.0

import optimisation

Supported import methods

Currently CNCH supports the following import methods:

INSERT INFILE
It is suitable for the situation where the data file has been generated. In the case of a small amount of data, you can directly read the remote/local data file and import it into CNCH. In the case of a large amount of data (more than a few GB), it is recommended to use PartWriter and Attach.
INSERT VALUES
Suitable for temporarily inserting a small amount of data for testing -INSERT SELECT
Suitable for situations where you need to save the results of a certain table for subsequent queries -Dumper&Attach
Suitable for migrating from the local version to CNCH, you can directly use Dumper to write the data of the entire cluster into HDFS, and then use Attach to move these data directly to CNCH and query them -PartWriter&Attach
For scenarios with a large amount of data and a large number of files, tasks such as Spark can be used to use PartWriter to generate new Part files from the original files and write them to HDFS, and then the Attach function can be used to move the data to CNCH for query

optimisation means

Direct write mode tuning

When using INSERT VALUES, INSERT INFILE, or PartWriter tools to write data, the number of Parts generated at the end can affect the number of times data is written to HDFS, which in turn affects the overall writing time. Therefore, it is recommended to minimize the number of Parts generated. The direct writing process is as follows:

read part of file data
Split this part of data according to PartitionBy
Split this part of data according to ClusterBy
Write the split data into a new Part and write it to HDFS

Optimisation means:

To reduce the number of Parts, data with the same partition and bucket should be arranged together in a file. This ensures that the number of Parts generated is as small as possible when reading new data. The data can be sorted based on the requirement that the partitions are the same and the buckets in the partitions are the same. The calculation rules for the buckets are as follows:

If SPLIT_NUMBER is not specified, SipHash will be calculated for the column used by ClusterByKey and then the BucketNumber will be moduloed to get BucketNumber
if SPLIT_NUMBER is specified
Calculate SplitValue
If a column is ClusterBy, use the dtspartition function to calculate the corresponding SplitValue
If ClusterBy has multiple columns, use SipHash to calculate the corresponding SplitValue of these columns
Calculate BucketNumber
If it is WithRange, use SplitValue * BucketCount / SplitNumber to calculate the corresponding BucketNumber
If not WithRange, use SplitValue % BucketCount to calculate the corresponding BucketNumber

When reading a file
If the data size of each row is not large, you can read a larger block at a time by increasing max_insert_block_size, thereby generating a larger Part
If the file to be read is not HDFS/CFS and multiple files are matched using wildcards, you can also increase min_insert_block_size_rows and min_insert_block_size_bytes at the same time

Dumper mode optimisation

During Dumper migration, you can specify a table and some of its partitions at a time and write them to HDFS. The amount of time required for this process depends on the data size of the table and the number of parts. To increase the concurrency of uploading parts, you can use the parallel parameter. It is also possible to wait for the parts to be merged locally before using Dumper to upload them.

import optimisation

Supported import methods

optimisation means

Direct write mode tuning​

Dumper mode optimisation​

Direct write mode tuning

Dumper mode optimisation