Skip to main content
Version: 0.3.x

Hive External Catalog

In addition to supporting access to Hive data using external tables, Byconity also supports access through the External Catalog feature.

Configuring external_catalog_mgr

We need to store the metadata of the External Catalog in Byconity's metadata store. Therefore, we need to configure external_catalog_mgr in cnch_config.yaml to specify the location where External Catalog-related metadata will be stored.

Typically, users can choose to store External Catalog metadata in the same location as Byconity metadata. Therefore, users can directly reuse the content of the catalog_service configuration previously used for metadata, such as:

external_catalog_mgr:
type: fdb
fdb:
cluster_file: /config/fdb.cluster

Creating an External Catalog

Currently, only IAM authentication is supported.

Creating a Hive Catalog

For data stored in S3:

create external catalog hive_s3
properties
type='hive',
hive.metastore.uri = 'thrift://hive_thrift_server_ip:port',
aws.s3.region= 's3_region',
aws.s3.endpoint = 's3_endpoint',
aws.s3.access_key = 's3_ak',
aws.s3.secret_key = 's3_sk'

For data stored in HDFS:

create external catalog hive_hdfs
properties
type='hive',
hive.metastore.uri = 'thrift://hive_thrift_server_ip:port'

Note: For Hive external tables stored in HDFS, we only support reading data configured in cnch-config.yaml.

Creating a Glue Catalog

We also provide experimental support for the AWS Glue Data Catalog.

create external catalog glue_s3
properties
type='glue',
aws.glue.endpoint = 'glue_endpoint',
aws.glue.region='glue_region',
aws.glue.catalog_id='glue_catalog_id',
aws.glue.access_key = 'glue_ak',
aws.glue.secret_key = 'glue_sk',
aws.s3.region= 's3_region',
aws.s3.endpoint = 's3_endpoint',
aws.s3.access_key = 's3_ak',
aws.s3.secret_key = 's3_sk'

Here, glue_catalog_id is a 12-digit AWS account ID. For more information, refer to Aws Account ID Doc.

Dropping an External Catalog

Users can drop an External Catalog as follows:

drop external catalog your_catalog_name;

Basic Usage

Assuming a user has already created an External Catalog called hive_s3.

Three-part Naming Convention

Users can directly access tables in Hive using the three-part naming convention: catalog_name.db_name.table_name. For example:

select * from hive_s3.hive_db_name.hive_table_name;

Native CnchMergeTree tables in Byconity can also be accessed using SQL like:

select * from cnch.cnch_db_name.cnch_table_name;
-- this is equivalent to select * from cnch_db_name.cnch_table_name;

cnch (short for cloud-native-clickhouse) is used as the default Catalog name in Byconity.

Cross-Catalog Queries

With the External Catalog, users can directly join Hive external tables with CnchMergeTree tables in Cnch:

select * from hive_s3.hive_db.hive_table union all select (1) from cnch.cnch_db.cnch_table;

Showing Databases and Tables

Listing database names in a Catalog:

show databases [from hive_catalog]

Listing table names in a database:

show tables from [hive_catalog.][database]

Getting the create statement for a table:

show create table [hive_catalog.][database.][table]

Please note that the show create table result for an external table will look similar to this:

CREATE TABLE `hive_catalog$$hive_db_name`.hive_table_name UUID 'some-uuid' (--field list-- `cc_call_center_sk` Nullable(Int64), `cc_call_center_id` Nullable(String))) ENGINE = CnchHive(hive_catalog, hive_db_name, hive_table_name) PARTITION BY tuple() SETTINGS endpoint = 'hive_endpoint', ak_id = 's3_ak', ak_secret = 's3_sk', region = 's3_region'

This is only used to display schema information for the external table and cannot be used to create an external table in Hive.

Currently, Byconity does not support directly creating Hive tables.

Switching Catalogs

Users can use the following SQL to change the default Catalog:

switch catalog hive_s3;

At this point, running:

select * from tpcds.call_center;

Byconity will read data from the call_center table in the tpcds database in Hive.

To switch back to the default Catalog, users can use:

switch catalog cnch;

Users can also use:

use hive_s3.tpcds;

to directly change the default database to tpcds in Hive.

To switch back, simply use:

use cnch.cnch_database_name;

Implementation Details

We use a SQL rewriting approach to support multiple Catalogs. For example:

select * from catalog_name.db.tbl;

is actually rewritten to:

select * from `catalog_name$$db`.tbl;

So users should not be surprised to see this in the logs.