Import Tuning
Direct Write Method Tuning
When using INSERT VALUES, INSERT INFILE, or the PartWriter tool for writing, the number of generated Parts directly affects the number of writes to HDFS, which in turn affects the overall write time. To optimize this process, the number of Parts should be minimized.
The direct write process is as follows:
- Read a portion of the file data.
- Split this data based on
PartitionBy. - Further split the data based on
ClusterBy. - Write the split data as new Parts and store them in HDFS.
Tuning suggestions:
To reduce the number of Parts, we can arrange data with the same partition and Bucket together in the file. This way, when reading some new data each time, the number of generated Parts will be as small as possible. The data can be sorted according to the requirements of having the same partition and the same Bucket within the partition.
The calculation rules for Buckets are:
- If
SPLIT_NUMBERis not specified, theBucketNumberis obtained by calculating the SipHash of the columns used byClusterByKeyand then taking the modulus of theBucketNumber. - If
SPLIT_NUMBERis specified:- Calculate
SplitValue:- If
ClusterByis a single column, use thedtspartitionfunction to calculate the correspondingSplitValue. - If
ClusterByis multiple columns, useSipHashto calculate the correspondingSplitValuefor these columns.
- If
- Calculate
BucketNumber:- If it is
WithRange, useSplitValue * BucketCount / SplitNumberto calculate the correspondingBucketNumber. - If it is not
WithRange, useSplitValue % BucketCountto calculate the correspondingBucketNumber.
- If it is
- Calculate
- When reading files:
- If the size of each row of data is small, consider increasing
max_insert_block_sizeto read larger blocks of data at a time, resulting in larger Parts. - If the read file is not from HDFS/CFS and multiple files are matched using wildcards, adjust both
min_insert_block_size_rowsandmin_insert_block_size_bytesaccordingly.
Dumper Method Tuning
When using Dumper for data migration, a table and some of its partitions can be specified at once and written to HDFS. The time consumption of this process is related to the data size of the table and the number of Parts. To optimize this process:
- The
parallelparameter can be used to increase the concurrency of uploading Parts. - It is possible to wait for the Parts to merge locally before using Dumper for uploading.