Skip to main content
Version: 0.3.x

Import Tuning

Direct Write Method Tuning

When using INSERT VALUES, INSERT INFILE, or the PartWriter tool for writing, the number of generated Parts directly affects the number of writes to HDFS, which in turn affects the overall write time. To optimize this process, the number of Parts should be minimized.

The direct write process is as follows:

  1. Read a portion of the file data.
  2. Split this data based on PartitionBy.
  3. Further split the data based on ClusterBy.
  4. Write the split data as new Parts and store them in HDFS.

Tuning suggestions:

  1. To reduce the number of Parts, we can arrange data with the same partition and Bucket together in the file. This way, when reading some new data each time, the number of generated Parts will be as small as possible. The data can be sorted according to the requirements of having the same partition and the same Bucket within the partition.

  2. The calculation rules for Buckets are:

  • If SPLIT_NUMBER is not specified, the BucketNumber is obtained by calculating the SipHash of the columns used by ClusterByKey and then taking the modulus of the BucketNumber.
  • If SPLIT_NUMBER is specified:
    • Calculate SplitValue:
      • If ClusterBy is a single column, use the dtspartition function to calculate the corresponding SplitValue.
      • If ClusterBy is multiple columns, use SipHash to calculate the corresponding SplitValue for these columns.
    • Calculate BucketNumber:
      • If it is WithRange, use SplitValue * BucketCount / SplitNumber to calculate the corresponding BucketNumber.
      • If it is not WithRange, use SplitValue % BucketCount to calculate the corresponding BucketNumber.
  1. When reading files:
  • If the size of each row of data is small, consider increasing max_insert_block_size to read larger blocks of data at a time, resulting in larger Parts.
  • If the read file is not from HDFS/CFS and multiple files are matched using wildcards, adjust both min_insert_block_size_rows and min_insert_block_size_bytes accordingly.

Dumper Method Tuning

When using Dumper for data migration, a table and some of its partitions can be specified at once and written to HDFS. The time consumption of this process is related to the data size of the table and the number of Parts. To optimize this process:

  • The parallel parameter can be used to increase the concurrency of uploading Parts.
  • It is possible to wait for the Parts to merge locally before using Dumper for uploading.