Skip to main content
Version: 0.3.x

External Tables

CnchHive

CnchHive is a table engine developed by ByConity that supports joint queries on external tables, allowing users to directly accelerate data queries without the need for data import. CnchHive supports querying data stored in HDFS and S3. It currently supports querying Hive data files stored in Parquet and ORC formats.

Table Creation Example

CREATE TABLE [IF NOT EXISTS] [db.]hive_ext
ENGINE = CnchHive('thrift://127.0.0.1:9083', 'hive_db', 'hive_table')
[SETTING vw_default = '']

Table information can be automatically imported from Hive metadata or manually specified with specific column names and their types.

Table Parameter Settings

CnchHive allows tables to be configured by specifying values in a configuration file, as follows:

<yandex>
<hive>
<vw_default>vw_default</vw_default>
</hive>
</yandex>
  • vw_default: The virtual table used to execute queries.
  • hdfs_fs: The URL of the HDFS namenode.
  • endpoint: The endpoint for S3.
  • region: The region for S3.
  • ak_id: The S3 access key ID.
  • ak_secret: The S3 secret access key.

Example Configuration for HDFS Hive Table

<hive>
<hdfs_fs>hdfs://127.0.0.1:9010</hdfs_fs>
</hive>

Example Configuration for S3 Hive Table

<hive>
<endpoint>http://127.0.0.1:9000</endpoint>
<ak_id>id</ak_id>
<ak_secret>secret</ak_secret>
</hive>

Disk Cache Configuration

Add the following configuration to worker.xml to enable caching:

<yandex>
<disk_cache_strategies>
<Hive>
<disk_cache_dir>hive</disk_cache_dir>
</Hive>
</disk_cache_strategies>
</yandex>

Query Settings

Disk Cache Mode

Determines whether to use disk caching in SELECT queries. Possible values:

  • AUTO - Use disk caching when available.
  • SKIP_DISK_CACHE - Do not use disk caching.

If there is a cache miss, Hive files will be cached in the background.

Default value: AUTO

Using hive_metastore_filter

When enabled, CnchHive checks if the query contains partition key conditions that can be passed to the Hive metastore to limit the partitions returned. The Hive metastore will return a filtered list of partitions based on the conditions. The Hive metastore can only recognize simple conditions.

Example

-- Column d is the Hive partition key
-- Partitions: d=1/, d=2/, d=3/

SELECT * FROM hive_ext WHERE d = 1
-- The Hive metastore will only return the partition d=1/

Default value: true

Using hive_partition_filter

When enabled, CnchHive further filters the list of partitions returned by the Hive metastore.

Example

-- Column d is the Hive partition key, with partitions:
-- d=20220203/, d=20220204/, d=20220205/

SELECT * FROM hive_ext WHERE toDate(d) = '2022-02-03'

-- The Hive metastore will return all three partitions, but CnchHive will apply partition filtering.
-- Only the partition d=20220203/ will be read in the end.

Default value: true

Using hive_split_level_filter

Parquet and ORC files have index data for each row group/stripe. If enabled, CnchHive will filter these row groups/stripes.

Default value: false

Table Functions

SELECT count() FROM hive('thrift://127.0.0.1', 'hive_db', 'hive_table')

You can query Hive tables without explicitly creating them.

SELECT * FROM hiveMetadata('thrift://127.0.0.1', 'hive_db', 'hive_table')

Returns the storage description of a Hive table from the Hive metastore.