Import Tuning
Direct Write Method Tuning
When using INSERT VALUES
, INSERT INFILE
, or the PartWriter
tool for writing, the number of generated Parts directly affects the number of writes to HDFS, which in turn affects the overall write time. To optimize this process, the number of Parts should be minimized.
The direct write process is as follows:
- Read a portion of the file data.
- Split this data based on
PartitionBy
. - Further split the data based on
ClusterBy
. - Write the split data as new Parts and store them in HDFS.
Tuning suggestions:
To reduce the number of Parts, we can arrange data with the same partition and Bucket together in the file. This way, when reading some new data each time, the number of generated Parts will be as small as possible. The data can be sorted according to the requirements of having the same partition and the same Bucket within the partition.
The calculation rules for Buckets are:
- If
SPLIT_NUMBER
is not specified, theBucketNumber
is obtained by calculating the SipHash of the columns used byClusterByKey
and then taking the modulus of theBucketNumber
. - If
SPLIT_NUMBER
is specified:- Calculate
SplitValue
:- If
ClusterBy
is a single column, use thedtspartition
function to calculate the correspondingSplitValue
. - If
ClusterBy
is multiple columns, useSipHash
to calculate the correspondingSplitValue
for these columns.
- If
- Calculate
BucketNumber
:- If it is
WithRange
, useSplitValue * BucketCount / SplitNumber
to calculate the correspondingBucketNumber
. - If it is not
WithRange
, useSplitValue % BucketCount
to calculate the correspondingBucketNumber
.
- If it is
- Calculate
- When reading files:
- If the size of each row of data is small, consider increasing
max_insert_block_size
to read larger blocks of data at a time, resulting in larger Parts. - If the read file is not from HDFS/CFS and multiple files are matched using wildcards, adjust both
min_insert_block_size_rows
andmin_insert_block_size_bytes
accordingly.
Dumper Method Tuning
When using Dumper for data migration, a table and some of its partitions can be specified at once and written to HDFS. The time consumption of this process is related to the data size of the table and the number of Parts. To optimize this process:
- The
parallel
parameter can be used to increase the concurrency of uploading Parts. - It is possible to wait for the Parts to merge locally before using Dumper for uploading.