External Tables
CnchHive
CnchHive is a table engine developed by ByConity that supports joint queries on external tables, allowing users to directly accelerate data queries without the need for data import. CnchHive supports querying data stored in HDFS and S3. It currently supports querying Hive data files stored in Parquet and ORC formats.
Table Creation Example
CREATE TABLE [IF NOT EXISTS] [db.]hive_ext
ENGINE = CnchHive('thrift://127.0.0.1:9083', 'hive_db', 'hive_table')
[SETTING vw_default = '']
Table information can be automatically imported from Hive metadata or manually specified with specific column names and their types.
Table Parameter Settings
CnchHive allows tables to be configured by specifying values in a configuration file, as follows:
<yandex>
<hive>
<vw_default>vw_default</vw_default>
</hive>
</yandex>
- vw_default: The virtual table used to execute queries.
- hdfs_fs: The URL of the HDFS namenode.
- endpoint: The endpoint for S3.
- region: The region for S3.
- ak_id: The S3 access key ID.
- ak_secret: The S3 secret access key.
Example Configuration for HDFS Hive Table
<hive>
<hdfs_fs>hdfs://127.0.0.1:9010</hdfs_fs>
</hive>
Example Configuration for S3 Hive Table
<hive>
<endpoint>http://127.0.0.1:9000</endpoint>
<ak_id>id</ak_id>
<ak_secret>secret</ak_secret>
</hive>
Disk Cache Configuration
Add the following configuration to worker.xml to enable caching:
<yandex>
<disk_cache_strategies>
<Hive>
<disk_cache_dir>hive</disk_cache_dir>
</Hive>
</disk_cache_strategies>
</yandex>
Query Settings
Disk Cache Mode
Determines whether to use disk caching in SELECT queries. Possible values:
- AUTO - Use disk caching when available.
- SKIP_DISK_CACHE - Do not use disk caching.
If there is a cache miss, Hive files will be cached in the background.
Default value: AUTO
Using hive_metastore_filter
When enabled, CnchHive checks if the query contains partition key conditions that can be passed to the Hive metastore to limit the partitions returned. The Hive metastore will return a filtered list of partitions based on the conditions. The Hive metastore can only recognize simple conditions.
Example
-- Column d is the Hive partition key
-- Partitions: d=1/, d=2/, d=3/
SELECT * FROM hive_ext WHERE d = 1
-- The Hive metastore will only return the partition d=1/
Default value: true
Using hive_partition_filter
When enabled, CnchHive further filters the list of partitions returned by the Hive metastore.
Example
-- Column d is the Hive partition key, with partitions:
-- d=20220203/, d=20220204/, d=20220205/
SELECT * FROM hive_ext WHERE toDate(d) = '2022-02-03'
-- The Hive metastore will return all three partitions, but CnchHive will apply partition filtering.
-- Only the partition d=20220203/ will be read in the end.
Default value: true
Using hive_split_level_filter
Parquet and ORC files have index data for each row group/stripe. If enabled, CnchHive will filter these row groups/stripes.
Default value: false
Table Functions
SELECT count() FROM hive('thrift://127.0.0.1', 'hive_db', 'hive_table')
You can query Hive tables without explicitly creating them.
SELECT * FROM hiveMetadata('thrift://127.0.0.1', 'hive_db', 'hive_table')
Returns the storage description of a Hive table from the Hive metastore.