import optimisation
Supported import methods
Currently CNCH supports the following import methods:
- INSERT INFILE
- It is suitable for the situation where the data file has been generated. In the case of a small amount of data, you can directly read the remote/local data file and import it into CNCH. In the case of a large amount of data (more than a few GB), it is recommended to use PartWriter and Attach.
- INSERT VALUES
- Suitable for temporarily inserting a small amount of data for testing -INSERT SELECT
- Suitable for situations where you need to save the results of a certain table for subsequent queries -Dumper&Attach
- Suitable for migrating from the local version to CNCH, you can directly use Dumper to write the data of the entire cluster into HDFS, and then use Attach to move these data directly to CNCH and query them -PartWriter&Attach
- For scenarios with a large amount of data and a large number of files, tasks such as Spark can be used to use PartWriter to generate new Part files from the original files and write them to HDFS, and then the Attach function can be used to move the data to CNCH for query
optimisation means
Direct write mode tuning
When using INSERT VALUES, INSERT INFILE, or PartWriter tools to write data, the number of Parts generated at the end can affect the number of times data is written to HDFS, which in turn affects the overall writing time. Therefore, it is recommended to minimize the number of Parts generated. The direct writing process is as follows:
- read part of file data
- Split this part of data according to PartitionBy
- Split this part of data according to ClusterBy
- Write the split data into a new Part and write it to HDFS
Optimisation means:
- To reduce the number of Parts, data with the same partition and bucket should be arranged together in a file. This ensures that the number of Parts generated is as small as possible when reading new data. The data can be sorted based on the requirement that the partitions are the same and the buckets in the partitions are the same. The calculation rules for the buckets are as follows:
- If SPLIT_NUMBER is not specified, SipHash will be calculated for the column used by ClusterByKey and then the BucketNumber will be moduloed to get BucketNumber
- if SPLIT_NUMBER is specified
- Calculate SplitValue
- If a column is ClusterBy, use the dtspartition function to calculate the corresponding SplitValue
- If ClusterBy has multiple columns, use SipHash to calculate the corresponding SplitValue of these columns
- Calculate BucketNumber
- If it is WithRange, use SplitValue * BucketCount / SplitNumber to calculate the corresponding BucketNumber
- If not WithRange, use SplitValue % BucketCount to calculate the corresponding BucketNumber
- When reading a file
- If the data size of each row is not large, you can read a larger block at a time by increasing max_insert_block_size, thereby generating a larger Part
- If the file to be read is not HDFS/CFS and multiple files are matched using wildcards, you can also increase min_insert_block_size_rows and min_insert_block_size_bytes at the same time
Dumper mode optimisation
During Dumper migration, you can specify a table and some of its partitions at a time and write them to HDFS. The amount of time required for this process depends on the data size of the table and the number of parts. To increase the concurrency of uploading parts, you can use the parallel parameter. It is also possible to wait for the parts to be merged locally before using Dumper to upload them.