Detailed Explanation of ByConity ELT Principles

September 10, 2023 · 10 min read

ByConity maintainer

Background

When it comes to data warehouses, the use of Extract-Transform-Load (ETL) or Extract-Load-Transform (ELT) is inevitable. It involves extracting data from different sources and in various formats into a data warehouse for processing. Traditionally, the data transformation process uses Extract-Transform-Load (ETL) to convert business data into a data model suitable for data warehouses. However, this relies on an ETL system independent of the data warehouse, resulting in high maintenance costs. As a cloud-native data warehouse, ByConity has gradually supported Extract-Load-Transform (ELT) since version 0.2.0, freeing users from maintaining multiple heterogeneous data systems. This article will introduce ByConity's capabilities, implementation principles, and usage methods related to ELT.

ETL Scenarios and Solutions

Differences between ELT and ETL

ETL: Describes the process of extracting data from a source, transforming it, and loading it into a destination (data warehouse). The Transform phase typically describes the preprocessing of data within the data warehouse.

ELT focuses on loading minimally processed data into the data warehouse, leaving most of the transformation operations to the analysis phase. Compared to ETL, it requires less data modeling and provides analysts with more flexibility. ELT has become the norm in big data processing today, posing many new requirements for data warehouses.

Challenges of Resource Duplication

A typical data pipeline is as follows: We ingest behavioral data, logs, clickstreams, etc., into a storage system using MQ/Kafka/Flink. The storage system can be further divided into on-premises HDFS and cloud-based OSS&S3 remote storage systems. Then, a series of ETL operations are performed on the data warehouse to provide data for OLAP systems for analysis and query. However, some businesses need to branch off from the above storage, exporting data from the overall pipeline at a certain stage of data analysis to perform ETL operations different from the main pipeline, resulting in duplicate data storage. Additionally, two different ETL logics may emerge during this process.

As the amount of data increases, the cost pressure brought by computational and storage redundancy also increases. Meanwhile, the expansion of storage space makes elastic scaling inconvenient.

Industry Solutions

In the industry, there are several approaches to address the above issues:

Data pre-calculation school: Tools like Kylin. If report generation in Hadoop systems is slow or aggregation capabilities are poor, data pre-calculation can be performed to calculate cubes or views in advance for configured indicators. During actual SQL queries, the cubes or views can be directly used for substitution and returned.
Streaming and batch integration school: Tools like Flink, Risingwave. Data is aggregated directly in memory for reports or large screens as it flows in. After aggregation, the results are written to HBase or MySQL for retrieval and display. Flink also exposes interfaces for intermediate states, i.e., queryable state, to enable users to better utilize state data. However, the final results still need to be reconciled with batch computation results, and if inconsistencies are found, backtracking operations may be required. The entire process tests the skills of operation and maintenance/development teams.
Data lake and warehouse integration & HxxP: Combining data lakes with data warehouses.

ELT in ByConity

Overall Execution Flow

System Requirements for ELT Tasks:

Overall scalability: Importing and transforming often require significant resources, and the system needs to scale horizontally to meet rapid data growth.
Reliability and fault tolerance: Large numbers of jobs can be scheduled orderly; in case of occasional task failures (OOM), container failures, etc., retries can be triggered; able to handle a certain degree of data skewness.
Efficiency and performance: Effective utilization of multi-core and multi-machine concurrency; fast data import; efficient memory usage (memory management); CPU optimization (vectorization, codegen).
Ecosystem and observability: Compatible with various tools; task status awareness; task progress awareness; failed log query; certain visualization capabilities.

Based on the requirements of ELT tasks and the difficulties encountered in current scenarios, ByConity has added the following features and optimizations.

Stage-level Scheduling

Principle Analysis

The current SQL execution process in ClickHouse is as follows:
- In the first stage, the Coordinator receives a distributed table query and converts it into a local table query for each shard node.
- In the second stage, the Coordinator aggregates the results from each node and returns them to the client.
ClickHouse converts the right table in Join operations into a subquery, which brings several issues that are difficult to resolve:
- Complex queries have multiple subqueries, resulting in high conversion complexity.
- When the Join table is large, it can easily cause OOM in worker nodes.
- Aggregation occurs at the Coordinator, putting pressure on it and easily becoming a performance bottleneck.

Unlike ClickHouse, we have implemented optimization for the execution of complex queries in ByConity. By splitting the execution plan, we transform the previous two-stage execution model into stage-level execution. During the logical plan phase, exchange operators are inserted based on operator types. During the execution phase, the entire execution plan is DAG-split based on exchange operators, and scheduling is performed stage by stage. The exchange operators between stages are responsible for data transmission and exchange. Key nodes:

Insertion of exchange nodes
Splitting of stages
Stage scheduler
Segment executer
Exchange manager

Here, we focus on the exchange perspective. As you can see in the figure above, at the top level is the query plan. When converting it to a physical plan, we transform it into different operators based on different data distribution requirements. The source layer, which receives data, is mostly unified and called ExchangeSource. Sinks have different implementations, such as BroadcastSink, Local, PartitionSink, etc., which run as part of map tasks. For cross-node data operations, we use a unified brpc streaming data transmission at the bottom level, and for local operations, we use memory queues. We have made very detailed optimizations for different points:

Data transmission layer
- In-process communication uses memory queues, without serialization, zero copy
- Inter-process communication uses brpc stream RPC, ensuring order preservation, connection reuse, status code transmission, compression, etc.
Operator layer
- Batch sending
- Thread reuse, reducing the number of threads

Benefits

Because ByConity fully adopts a multi-stage query execution approach, there are significant overall benefits:

More stable and efficient Coordinator
- Aggregation and other operators are split to worker nodes for execution
- The Coordinator node only needs to aggregate the final results
Reduced Worker OOM
- Stage splitting makes each stage's computation relatively simple
- The addition of exchange operators reduces memory pressure
More stable and efficient network connections
- Effective transmission by exchange operators
- Reuse of connection pools

Adaptive Scheduler

The Adaptive Scheduler is a feature we have implemented for stability. In OLAP scenarios, it may be found that some data is incomplete or data queries timeout, often due to the fact that each worker is shared by all queries. Once a worker is slow, it can affect the execution of the entire query.

Issues with shared computation nodes:

The load on the node where Scan occurs is related to the amount of scan data required by different queries, and it cannot be perfectly averaged.
The resource requirements vary greatly among Plan Segments. This leads to severe load imbalance among worker nodes. Heavily loaded worker nodes can affect the overall progress of the query. Therefore, we have implemented the following optimization solutions:
Establishment of a Worker health mechanism. The server side establishes a Worker health management class that can quickly obtain health information about the Worker Group, including CPU, memory, the number of running queries, etc.
Adaptive scheduling: Each SQL dynamically selects and controls the concurrency of computation nodes based on Worker health.

Query Queue Mechanism

Our clusters may also experience full load situations, where all workers are unhealthy or overloaded/overloaded. In such cases, we use a query queue for optimization. We directly implemented a manager on the server side. Each time a query is made, the manager checks the cluster's resources and holds a lock. If resources are insufficient, it waits for resources to be released before waking up the lock. This avoids the server issuing unbounded computation tasks, leading to worker node overloads and crashes. The current implementation is relatively simple. The server is multi-instanced, and each server instance has its own queue, providing a local perspective lacking a global resource perspective. Additionally, the query status in each queue is not persisted but simply cached in memory. In the future, we will add coordination between servers to limit query concurrency from a global perspective. We will also persist

Async Execution

A typical characteristic of ELT tasks is that their running time is relatively long compared to real-time analysis. Generally, ELT tasks can take minutes or even hours to execute. Currently, ClickHouse's client queries are returned in a blocking manner. This results in the client remaining in a waiting state for an extended period, during which it needs to maintain a connection with the server. In unstable network conditions, the connection between the client and server may be disconnected, leading to task failures on the server side. To reduce such unnecessary failures and reduce the complexity of maintaining connections for the client, we have developed an asynchronous execution feature, which is implemented as follows:

User-specified asynchronous execution. Users can specify asynchronous execution on a per-query basis by using settings enable_async_query = 1. Alternatively, they can set it at the session level using set enable_async_query = 1.
If the query is asynchronous, it is placed in a background thread pool for execution.
Silent I/O. When an asynchronous query is executing, its interaction with the client, such as log output, needs to be severed.

Initialization of the query still occurs in the session's synchronous thread. Once initialization is complete, the query state is written to the metastore, and an async query ID is returned to the client. The client can use this ID to query the status of the query. After the async query ID is returned, it indicates the completion of the interaction for this query. In this mode, if the statement is a SELECT, subsequent results cannot be sent back to the client. In such cases, we recommend users use a combination of async query and SELECT...INTO OUTFILE to meet their needs.

Future Plans

Regarding ELT mixed loads, the ByConity 0.2.0 version is just the beginning. In subsequent versions, we will continue to optimize query-related capabilities, with ELT as the core focus. The plans are as follows:

Fault Recovery Capabilities

Operator Spill
- Spill for Sort, Agg, and Join operators;
- Exchange Spill capability;
Recoverability
- Operator execution recovery: When ELT tasks run for a long time, occasional failures of intermediate tasks can lead to the failure of the entire query. Supporting task-level retries can significantly reduce occasional failures caused by environmental factors;
- Stage retries: When a node fails, stage-level retries can be performed;
- Ability to save queue job states;
Remote Shuffle Service: Currently, open-source shuffle services in the industry are often tailored for Spark and lack generic clients, such as C++ clients. We will supplement this capability in the future.

Resources

Specifiable computational resources: Users can specify the computational resources required for a query;
Computational resource estimation/reservation: Dynamically estimate the computational resources required for a query and allocate them through reservation;
Dynamic resource allocation: Currently, workers are permanent processes/nodes. Dynamic resource allocation can improve utilization;
Fine-grained resource isolation: Reduce the mutual influence between queries through worker group or process-level isolation;

Performance Comparison Analysis of ByConity and Mainstream Open-Source OLAP Engines (ClickHouse, Doris, Presto)

July 21, 2023 · 9 min read

Yunbo Wang

ByConity maintainer

Introduction

As the amount and complexity of data continue to increase, more and more companies are using OLAP (Online Analytical Processing) engines to process large-scale data and provide instant analysis results. Performance is a crucial factor when selecting an OLAP engine. Therefore, this article will compare the performance of four open-source OLAP engines: ClickHouse, Doris, Presto, and ByConity, using the 99 query statements from the TPC-DS benchmark test. The aim is to provide a reference for companies to choose a suitable OLAP engine.

Introduction to TPC-DS Benchmark Test

TPC-DS (Transaction Processing Performance Council Decision Support Benchmark) is a benchmark test designed for decision support systems (DSS). Developed by the TPC organization, it simulates multidimensional analysis and decision support scenarios, providing 99 query statements to evaluate the performance of database systems in complex multidimensional analysis scenarios. Each query is designed to simulate complex decision support scenarios, including joins across multiple tables, aggregations and groupings, subqueries, and other advanced SQL techniques.

Introduction to OLAP Engines

ClickHouse, Doris, Presto, and ByConity are currently popular open-source OLAP engines known for their high performance and scalability.

ClickHouse is a column-based database management system developed by Yandex, a Russian search engine company. It focuses on fast query and analysis of large-scale data.
Doris is a distributed column-based storage and analysis system that supports real-time query and analysis and can integrate with big data technologies such as Hadoop, Spark, and Flink.
Presto is a distributed SQL query engine developed by Facebook that enables fast query and analysis on large-scale datasets.
ByConity is a cloud-native data warehouse open-sourced by ByteDance. It adopts a storage-compute separation architecture, achieves tenant resource isolation, elastic scaling, and strong consistency in data read and write. It supports mainstream OLAP engine optimization techniques and exhibits excellent read and write performance.

This article will test the performance of these four OLAP engines using the 99 query statements from the TPC-DS benchmark test and compare their performance differences in different types of queries.

Test Environment and Methodology

Test Environment Configuration

Environment Configuration
- Memory: 256GB
- Disk: ATA, 7200rpm, partitioned:gpt
- System: Linux 4.14.81.bm.30-amd64 x86_64, Debian GNU/Linux 9
Test Data Volume
- Using 1TB of data tables, equivalent to 2.8 billion rows of data

Software Name	Version	Release Date	Number of Nodes	Other Configurations
ClickHouse	23.4.1.1943	2023-04-26	5 Workers	distributed_product_mode = 'global', partial_merge_join_optimizations = 1
Doris	1.2.4.1	2023-04-27	5 BEs, 1 FE	Bucket configuration: Dimension table 1, returns table 10-20, sales table 100-200
Presto	0.28.0	2023-03-16	5 Workers, 1 Coordinator	Hive Catalog, ORC format, Xmx200GB, enable_optimizer=1, dialect_type='ANSI'
ByConity	0.1.0-GA	2023-03-15	5 Workers	enable_optimizer=1, dialect_type='ANSI'

Server Configuration

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2494.435
CPU max MHz:           2900.0000
CPU min MHz:           1200.0000
BogoMIPS:              4389.83
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

Test Methodology

Use the 99 query statements from the TPC-DS benchmark test and 1TB (2.8 billion rows) of data to test the performance of the four OLAP engines.
Use the same test dataset in each engine and maintain the same configuration and hardware environment.
Execute each query multiple times and take the average value to reduce measurement errors, with a query timeout set to 500 seconds.
Record details of query execution, such as query execution plans, I/O and CPU usage.

Performance Test Results

We used the same dataset and hardware environment to test the performance of these four OLAP engines. The test dataset size is 1TB, and the hardware and software environments are as described above. We conducted three consecutive tests on each of the four OLAP engines using the 99 query statements from the TPC-DS benchmark test and took the average of the three results. Among them, ByConity successfully ran all 99 query tests. Doris crashed on SQL15 and had four timeouts, specifically SQL54, SQL67, SQL78, and SQL95. Presto had timeouts only on SQL67 and SQL72, while all other queries ran successfully. ClickHouse only ran 50% of the query statements, with some timing out and others reporting system errors. The analysis revealed that ClickHouse does not effectively support multi-table join queries, requiring manual rewriting and splitting of such SQL statements for execution. Therefore, we temporarily exclude ClickHouse from the comparison of total execution time. The total execution time for the TPC-DS tests of the other three OLAP engines is shown in Figure 1. As seen in Figure 1, the query performance of the open-source ByConity is significantly better than the other engines, approximately 3-4 times faster. (Note: The vertical axis units of all charts below are in seconds.)

Figure 1: Total Execution Time for 99 Queries in TPC-DS

Based on the 99 query statements from the TPC-DS benchmark test, we will categorize them according to different query scenarios, such as basic queries, join queries, aggregation queries, subqueries, and window function queries. We will use these categories to analyze and compare the performance of ClickHouse, Doris, Presto, and ByConity:

Basic Query Scenario

This scenario involves simple query operations, such as retrieving data from a single table, filtering, and sorting results. The performance test of basic queries mainly focuses on the ability to process individual queries. Among them, ByConity performs best, with Presto and Doris also showing good performance. This is because basic queries usually involve only a small number of tables and fields, allowing Presto and Doris to fully utilize their distributed query features and in-memory computing capabilities. ClickHouse struggles with multi-table joins, resulting in some queries timing out. Specifically, SQL5, 8, 11, 13, 14, 17, and 18 all timed out. We calculated these timeouts as 500 seconds but truncated them to 350 seconds for clearer display. Figure 2 shows the average query time for the four engines in the basic query scenario:

Figure 2: Performance Comparison of Basic Queries in TPC-DS

Join Query Scenario

Join queries are common multi-table query scenarios that typically use JOIN statements to connect multiple tables and retrieve data based on specified conditions. As seen in Figure 3, ByConity performs best, mainly due to its optimized query optimizer, which introduces cost-based optimization capabilities (CBO) and performs optimization operations such as re-ordering during multi-table joins. Presto and Doris follow closely behind, while ClickHouse performs relatively poorly in multi-table joins and does not support many complex statements well.

Figure 3: Performance Comparison of Join Queries in TPC-DS

Aggregation Query Scenario

Aggregation queries involve statistical calculations on data, such as testing the use of aggregate functions like SUM, AVG, COUNT. ByConity continues to perform exceptionally well, followed by Doris and Presto. ClickHouse experienced four timeouts. To facilitate comparison, we truncated the timeout value to 250 seconds.

Figure 4: Performance Comparison of Aggregation Queries in TPC-DS

Subquery Scenario

Subqueries are nested within SQL statements and often serve as conditions or constraints for the main query. As shown in Figure 5, ByConity performs best due to its rule-based optimization (RBO) capabilities. ByConity optimizes complex nested queries holistically through techniques like operator pushdown, column pruning, and partition pruning, eliminating all subqueries and transforming common operators into Join+Agg formats. Doris and Presto also perform relatively well, but Presto experienced timeouts on SQL68 and SQL73, and Doris experienced timeouts on three SQL queries. ClickHouse also had some timeouts and system errors, as mentioned earlier. For easier comparison, we truncated the timeout value to 250 seconds.

Figure 5: Performance Comparison of Subqueries in TPC-DS

Window Function Query Scenario

Window function queries are advanced SQL query scenarios that enable ranking, grouping, sorting, and other operations within query results. As shown in Figure 6, ByConity exhibits the best performance, followed by Presto. Doris experienced a single timeout, and ClickHouse still had some TPC-DS tests that did not complete successfully.

Figure 6: Performance Comparison of Window Function Queries in TPC-DS

Conclusion

This article analyzes and compares the performance of four OLAP engines - ClickHouse, Doris, Presto, and ByConity - using the 99 query statements from the TPC-DS benchmark test. We found that the performance of the four engines varies under different query scenarios. ByConity performs exceptionally well in all 99 TPC-DS query scenarios, surpassing the other three OLAP engines. Presto and Doris perform relatively well in join queries, aggregation queries, and window function queries. However, ClickHouse's design and implementation are not specifically optimized for join queries, resulting in subpar performance in multi-table join scenarios.

It is important to note that performance test results depend on multiple factors, including data structure, query type, and data model. In practical applications, it is necessary to consider various factors comprehensively to select the most suitable OLAP engine.

When selecting an OLAP engine, other factors such as scalability, usability, and stability should also be considered. In practical applications, it is essential to select based on specific business needs and configure and optimize the engine reasonably to achieve optimal performance.

In summary, ClickHouse, Doris, Presto, and ByConity are all excellent OLAP engines with different strengths and applicable scenarios. In practical applications, it is necessary to select based on specific business needs and configure and optimize the engine reasonably to achieve optimal performance. At the same time, it is essential to select representative query scenarios and datasets and conduct testing and analysis for different query scenarios to comprehensively evaluate the engine's performance.

Join Us

The ByConity community has a large number of users and is a very open community. We invite everyone to discuss and contribute together. We have established an issue on Github: https://github.com/ByConity/ByConity/issues/26. You can also join our Feishu group, Slack, or Discord to participate in the discussion.

ByteDance Open Sources Its Cloud Native Data Warehouse ByConity

May 24, 2023 · 12 min read

Vini Jaiswal

ByConity maintainer

ByConity is an open-source cloud-native data warehouse developed by ByteDance. It utilizes a computing-storage separation architecture and offers various essential features, including the separation of computing and storage, elastic scalability, tenant resource isolation, and strong consistency in data reading and writing. By leveraging optimizations from popular OLAP engines like column storage, vectorized execution, MPP execution, and query optimization, ByConity delivers exceptional read and write performance.

History of ByConity

The origins of ByConity can be traced back to 2018 when ByteDance initially implemented ClickHouse for internal use. As the business grew, the data volume increased significantly to cater to a large user base. However, ClickHouse's Shared-Nothing architecture, where each node operates independently without sharing storage resources, posed certain challenges during its usage. Here are some of the issues encountered:

Expansion and contraction:

Due to the tight coupling of computing and storage resources in ClickHouse, scaling the system incurred higher costs and involved data migration. This prevented real-time on-demand scalability, resulting in inefficient resource utilization.

Multi-tenancy and shared cluster environment:

ClickHouse's tightly coupled architecture led to interactions among multiple tenants in a shared cluster environment. Since reading and writing operations were performed on the same node, they often interfered with each other, impacting overall performance.

Performance limitations:

ClickHouse's support for complex queries, such as multi-table join operations, was not optimal, which hindered the system's ability to handle such queries efficiently.

To address these pain points, ByteDance undertook an architectural upgrade of ClickHouse. In 2020, we initiated the ByConity project internally. After releasing the Beta version in January 2023, the project was officially made available to the public at the end of May 2023.

Figure 1 ByteDance ClickHouse usage

Figure 1: ByteDance ClickHouse Usage

Features of ByConity

ByConity implements a computing and storage separation architecture that transforms the original local management of computing and storage on individual nodes. Instead, it adopts a unified management approach for all data across the entire cluster using distributed storage. This transformation results in stateless computing nodes, enabling dynamic expansion and contraction by leveraging the scalability of distributed storage and the stateless nature of computing nodes. ByConity offers several crucial features that enhance its functionality and performance:

Storage-Computing Separation

One of the critical advantages of ByConity is its storage-computing separation architecture, which enables read-write separation and elastic scaling. This architecture ensures that read and write operations do not affect each other, and computing resources and storage resources can be independently expanded and contracted on demand, ensuring efficient resource utilization. ByConity also supports multi-tenant resource isolation, making it suitable for multi-tenant environments.

Figure 2: ByConity storage-computing separation to achieve multi-tenant isolation

Resource Isolation

ByConity provides resource isolation, ensuring that different tenants have separate and independent resources. This prevents interference or impact between tenants, promoting data privacy and efficient multi-tenancy support.

Elastic Scaling

ByConity supports elastic expansion and contraction, allowing for real-time and on-demand scaling of computing resources. This flexibility ensures efficient resource utilization and enables the system to adapt to changing workload requirements.

Strong Data Consistency

ByConity ensures strong consistency of data read and write operations. This ensures that data is always up-to-date and eliminates any inconsistencies between read and write operations, guaranteeing data integrity and accuracy.

High Performance

ByConity incorporates optimization techniques from mainstream OLAP engines, such as column storage, vectorized execution, MPP execution, and query optimization. These optimizations enhance read and write performance, enabling faster and more efficient data processing and analysis.

Technical Architecture of ByConity

ByConity follows a three-layer architecture consisting of:

Service access layer: The service access layer, represented by ByConity Server, handles client data and service access.
Computing layer: The computing layer comprises multiple computing groups, where each Virtual Warehouse (VW) functions as a computing group.
Data storage layer: The data storage layer utilizes distributed file systems like HDFS and S3.

Figure 3: ByConity's architecture

Working Principle of ByConity

ByConity is a powerful open-source cloud-native data warehouse that adopts a storage-computing separation architecture. In this section, we will examine the working principle of ByConity and the interaction process of each component of ByConity through the complete life cycle of a SQL.

Figure 4: ByConity internal component interaction diagram

Figure 4 depicts the interaction diagram of ByConity's components. In the figure, the dotted line represents the inflow of a SQL query, the double-headed arrow indicates component interaction, and the one-way arrow represents data processing and output to the client. Let's explore the interaction process of each component in ByConity throughout the complete lifecycle of a SQL query.

ByConity's working principle can be divided into three stages:

Stage 1: Query Request

The client submits a Query request to the server. The server initially performs parsing and subsequently analyzes and optimizes the query through the Query Analyzer and Optimizer to generate an efficient executable plan. To access the required metadata, which is stored in a distributed key-value (KV) store, ByConity leverages FoundationDB and reads the metadata from the Catalog.

Stage 2: Plan Scheduling

ByConity passes the optimized executable plan to the Plan Scheduler component. The scheduler accesses the Resource Manager to obtain available computing resources and determines which nodes to schedule the query tasks for execution.

Stage 3: Query Execution

The Query request is executed on ByConity's Workers. The Workers read data from the underlying Cloud Storage and perform computations by establishing a Pipeline. The server then aggregates the calculation results from multiple Workers and returns them to the client.

Additionally, ByConity includes two main components: Time-stamp Oracle and Daemon Manager. The time-stamp oracle supports transaction processing, while the daemon manager manages and schedules subsequent tasks.

Main Component Library

To better understand the working principle of ByConity, let's take a look at the main components of ByConity:

Metadata Management

ByConity offers a highly available and high-performance metadata read and write service called the Catalog Server. It supports complete transaction semantics (ACID). Furthermore, we have designed the Catalog Server with a flexible architecture, allowing for the pluggability of backend storage systems. Currently, we support Apple's open-source FoundationDB, and there is potential for extending support to other backend storage systems in the future.

Query Optimizer

The query optimizer plays a crucial role in the performance of a database system. A well-designed optimizer can significantly enhance query performance, particularly in complex query scenarios, where it can achieve performance improvements ranging from several times to hundreds of times. ByConity's self-developed optimizer focuses on improving optimization capabilities through two main approaches:

RBO (Rule-Based Optimization): This capability encompasses various optimizations such as column pruning, partition pruning, expression simplification, subquery dissociation, predicate pushdown, redundant operator elimination, Outer-Join to Inner-Join conversion, operator pushdown storage, distributed operator splitting, and other heuristic optimization techniques.
CBO (Cost-Based Optimization): ByConity's optimizer also includes cost-based optimization capabilities. This includes support for join reorder, outer-join reorder, join/agg reorder, common table expressions (CTE), materialized views, dynamic filter push-down, magic set optimization, and other cost-based techniques. Additionally, it integrates property enforcement for distributed planning.

Query Scheduling

ByConity currently supports two query scheduling strategies: Cache-aware scheduling and Resource-aware scheduling.

The cache-aware scheduling focuses on scenarios where computing and storage are separated. Its objective is to maximize cache utilization and minimize cold reads. This strategy aims to schedule tasks to nodes that have corresponding data caches, enabling computations to leverage the cache and improve read and write performance. Additionally, considering the dynamic expansion and contraction of the system, cache-aware scheduling strives to minimize the impact of cache failure on query performance when the computing group's topology changes.
Resource-aware scheduling analyzes the resource usage of different nodes within the computing group across the entire cluster. It performs targeted scheduling to optimize resource utilization. Moreover, resource-aware scheduling incorporates flow control mechanisms to ensure rational resource utilization and prevent negative effects caused by overload, such as system downtime.

Computing Group

ByConity enables different tenants to utilize distinct computing resources, as depicted in Figure 5. With ByConity's architecture, implementing features like multi-tenant isolation and read-write separation becomes straightforward. Each tenant can leverage separate computing groups to achieve multi-tenant isolation and support read-write separation. The computing groups can be dynamically expanded and contracted on-demand, ensuring efficient resource utilization. During periods of low resource utilization, resource sharing can be employed, allowing computing groups to be allocated to other tenants to maximize resource utilization and minimize costs.

Virtual File System

The virtual file system module serves as an intermediary layer for data reading and writing. ByConity has optimized this module to provide a "storage as a service" capability to other modules. The virtual file system offers a unified file system abstraction, shielding the underlying different back-end implementations. It facilitates easy expansion and supports multiple storage systems, such as HDFS or object storage.

Cache Acceleration

ByConity utilizes caching to accelerate query processing. Under the computing-storage separation architecture, cache acceleration is performed in both the metadata and data dimensions. In the metadata dimension, ByConity caches Table and Partition information in the memory of the server-side (ByConity Server). In the data dimension, cache acceleration occurs on the Worker side within the computing group. This hierarchical caching mechanism utilizes both memory and disk, with Mark collection serving as the cache granularity. These caching strategies effectively enhance query speed and performance.

How to Deploy Byconity

ByConity currently supports four acquisition and deployment modes. Community developers are welcome to use them and submit issues to us:

Stand-alone Docker: ByConity provides a Docker deployment option, which can be accessed at https://github.com/ByConity/byconity-docker
K8s cluster deployment: ByConity also supports deployment on Kubernetes clusters. The deployment guide for Kubernetes can be found at https://github.com/ByConity/byconity-deploy
Physical machine deployment: If you prefer to deploy ByConity on physical machines, you can refer to the repository at https://github.com/ByConity/ByConity/tree/master/packages
Source code compilation: You can compile the ByConity source code yourself. The source code repository can be accessed at https://github.com/ByConity/ByConity#build-byconity

ByConity's Future Open-Source Plan

ByConity includes several key milestones in its open-source community roadmap through 2023. These milestones are designed to enhance ByConity's functionality, performance, and ease of use. Among them, the development of new storage engines, support for more data types, and integration with other data management tools are some important areas of focus. We have listed the following directions, and created an issue on Github: https://github.com/ByConity/ByConity/issues/26, inviting the community to join us to discuss co-development:

Performance improvement: ByConity aims to boost performance through various optimizations. This includes leveraging indexes for acceleration, such as Skip-index optimization, support for new Zorder-index and inverted indexes. ByConity will also focus on the construction and acceleration of external indexes, as well as the automatic recommendation and conversion of indexes. Continuous enhancements to the query optimizer and the implementation of a distributed caching mechanism are also part of the performance improvement efforts.
Stability improvement: There are two aspects here.
- One is to support resource isolation in more dimensions. ByConity is committed to improving stability by extending resource isolation capabilities in multiple dimensions, thereby providing better multi-tenant support.
- The second direction is to enrich metrics and improve observability and problem diagnosis capabilities, ensuring a stable and reliable experience for users.
Enterprise-level feature enhancements: ByConity aims to introduce finer-grained authority control, improve data security-related functions such as backup, recovery, and data encryption and continue to explore techniques for deep compression of data to save storage costs.
Ecosystem compatibility improvement: ByConity plans to expand its compatibility with various storage systems, including popular object storage solutions like S3 and TOS. It plans to enhance the overall compatibility and integration capabilities, facilitating seamless integration with other tools and frameworks. Moreover, it aims to support data lake federation queries, enabling interaction with technologies like Hudi, Iceberg, and more.

Working with the Community

Since the release of the Beta version, ByConity has received support from numerous enterprise developers, including Huawei, Electronic Cloud, Zhanxinzhanli, Tianyi Cloud, Vipshop, and Transsion Holdings. These organizations have actively contributed by deploying ByConity in their respective environments, undergoing TPC-DS verification, and conducting tests in their business scenarios. The results have been promising, and their feedback has provided valuable insights for improvement, which we greatly appreciate.

We are delighted to receive the ideas and willingness of community partners to build together. We have already initiated joint development efforts with Huawei Terminal Cloud. Our collaborative endeavors will focus on various areas, such as Kerberos authentication, ORC support, and integration with S3 storage.

If you are interested in joining our community and participating in the development of ByConity, we invite you to visit our GitHub repository at https://github.com/ByConity/ByConity. You can find more information and details about our ongoing projects, contribute your ideas, and collaborate with us to further enhance ByConity. To get involved, simply scan the QR code provided below to join our Discord or follow us on Twitter.

ByConity Discord Group

ByConity Twitter

Summary

In summary, ByConity is an open source cloud-native data warehouse that offers features such as read-write separation, elastic expansion and contraction, tenant resource isolation, and strong data read and write consistency. It utilizes a computing-storage separation architecture and leverages optimizations from mainstream OLAP engines to deliver excellent read and write performance. As ByConity continues to evolve and improve, it aims to become a key tool for cloud-native data warehousing in the future.

ByConity -- An Open source Cloud-native Data Warehouse

April 10, 2023 · 10 min read

Zhaojie Niu

ByConity maintainer

Yunbo Wang

ByConity maintainer

Introduction to ByConity

ByConity is an open-source cloud-native data warehouse that adopts a storage-computing separation architecture. It supports several key features, including separation of storage and computing, elastic expansion and contraction, isolation of tenant resources, and strong consistency of data read and write. By utilizing mainstream OLAP engine optimizations, such as column storage, vectorized execution, MPP execution, query optimization, etc., ByConity can provide excellent read and write performance.

ByConity's History

The background of ByConity can be traced back to 2018 when ByteDance began to use ClickHouse internally. Due to the development of business, the scale of data has become larger and larger to serve a large number of users. However, because ClickHouse is a Shared-Nothing architecture, each node is independent and does not share storage resources, so computing resources and storage resources are tightly coupled. This leads to a higher cost of expansion and contraction, and involves data migration, which prevents real-time and on-demand expansion and contraction, resulting in a waste of resources. Furthermore, the tightly coupled architecture of ClickHouse will cause multi-tenants to interact with each other in the shared cluster. In addition, because reading and writing are completed on one node, reading and writing are affected. Finally, ClickHouse does not support performance in complex queries such as multi-table join. Based on these pain points, the ByConity project was launched in January 2020.

The ByConity team hopes to give the project back to the community and improve it through the power of open source. In January 2023, ByConity was officially open-sourced, and the beta version was released.

Figure 1 ByteDance ClickHouse usage

Figure 1: ByteDance ClickHouse Usage

Features of ByConity

ByConity has several key features that make it a powerful open-source cloud-native data warehouse.

Storage-Computing Separation

Figure 2: ByConity storage-computing separation to achieve multi-tenant isolation

Elastic Scaling

ByConity supports flexible expansion and contraction, enabling real-time and on-demand expansion and contraction of computing resources, ensuring efficient use of resources.

Resource Isolation

ByConity isolates the resources of different tenants, ensuring that tenants are not affected by each other.

Strong Data Consistency

ByConity ensures strong consistency of data read and write, ensuring that data is always up to date with no inconsistencies between reads and writes.

High Performance

ByConity adopts mainstream OLAP engine optimizations, such as column storage, vectorized execution, MPP execution, query optimization, etc., ensuring excellent read and write performance.

ByConity's Technical Architecture

ByConity's architecture is divided into three layers:

Service access layer: Responsible for client data and service access, i.e., ByConity Server
Computing group: ByConity's computing resource layer, where each Virtual Warehouse is a computing group
Data storage: Distributed file system, such as HDFS, S3, etc.

Figure 3: ByConity's architecture

Working Principle of ByConity

Figure 4: ByConity internal component interaction diagram

Figure 4 above is the interaction diagram of ByConity components. The dotted line in the figure indicates the inflow of a SQL, the two-way arrow in the solid line indicates the interaction within the component, and the one-way arrow indicates the data processing and output to the client.

ByConity's working principle can be divided into three stages:

Stage 1: Query Request

The client submits a query request to the server, and the server first performs parsing, then analyzes and optimizes through analyzer and optimizer to generate a more efficient executable plan. Here, metadata MetaData is read, which is stored in a distributed KV. ByConity uses FoundationDB and reads the metadata through the Catalog.

Stage 2: Plan Scheduling

ByConity submits the executable plan generated by the analysis and optimizer to the scheduler (Plan Scheduler). The scheduler obtains idle computing resources by accessing the resource manager and decides which nodes to schedule query tasks for execution.

Stage 3: Query Execution

Query requests are finally executed on ByConity's Worker, and the Worker will read data from the lowest-level Cloud storage and perform calculations by establishing a pipeline. Finally, the calculation results of multiple workers are aggregated by the server and returned to the client.

In addition to the above components, ByConity also has two main components, Time-stamp Oracle and Deamon Manager. The former ByConity supports transaction processing, and the latter manages and schedules some subsequent tasks.

Main Component Library

To better understand the working principle of ByConity, let's take a look at the main components of ByConity:

Metadata Management

ByConity provides a highly available and high-performance metadata read and write service - Catalog Server. And ByConity supports complete transaction semantics (ACID). At the same time, we have made a better abstraction of the Catalog Server, making the back-end storage system pluggable. Currently, we support Apple's open-source FoundationDB, which can be expanded to support more back-end storage systems later.

Query Optimizer

The query optimizer is one of the cores of the database system. A good optimizer can greatly improve query performance. ByConity's self-developed optimizer improves optimization capabilities based on two directions:

RBO: Rule-Based Optimization capability. Support: column pruning, partition pruning, expression simplification, subquery disassociation, predicate pushdown, redundant operator elimination, outer-JOIN to INNER-JOIN, operator pushdown storage, distributed operator splitting, etc.
CBO: Cost-Based Optimization capability. Support: Join Reorder, Outer-Join Reorder, Join/Agg Reorder, CTE, Materialized View, Dynamic Filter Push-Down, Magic Set, and other cost-based optimization capabilities. And integrate Property Enforcement for distributed planning.

Query Scheduling

ByConity currently supports two query scheduling strategies: Cache-aware scheduling and Resource-aware scheduling.

The cache-aware scheduling policy is aimed at scenarios where storage and computing are separated, aiming to maximize the use of the cache and avoid cold reads. The cache-aware scheduling strategy will try to schedule tasks to nodes with corresponding data caches, so that calculations can hit the cache and improve read and write performance.
Resource-aware scheduling perceives the resource usage of different nodes in the computing group in the entire cluster and performs targeted scheduling to maximize resource utilization. At the same time, it also performs flow control to ensure reasonable use of resources and avoid negative effects caused by overload, such as system downtime.

Computing Group

ByConity supports different tenants to use different computing resources. Under ByConity's new architecture, it is easy to implement features such as multi-tenant isolation and read-write separation. Different tenants can use different computing groups to achieve multi-tenant isolation and support read-write separation. Due to the convenient expansion and contraction, the computing group can be dynamically expanded and contracted on demand to ensure efficient resource utilization. When resource utilization is not high, resource sharing can be carried out, and computing groups can be seconded to other tenants to maximize resource utilization and reduce costs.

Virtual File System

The virtual file system module is used as the middle layer of data reading and writing. ByConity has made a better package, exposing storage as a service to other modules to realize "storage as a service". The virtual file system provides a unified file system abstraction, shields different back-end implementations, facilitates expansion, and supports multiple storage systems, such as HDFS or object storage.

Cache Acceleration

ByConity performs query acceleration through caching. Under the architecture of separating storage and computing, ByConity performs cache acceleration in both metadata and data dimensions. In the metadata dimension, by caching in the memory of ByConity's Server side, table, and partition are used as granularity. In the data dimension, ByConity's Worker side, that is, the computing group, is used for caching, and the cache on the Worker side is hierarchical. At the same time, memory and disk are used, and the mark set is used as the cache granularity, thereby effectively improving the query speed.

How to Obtain and Deploy

ByConity currently supports four acquisition and deployment modes. Community developers are welcome to use them and submit issues to us:

Stand-alone version
- Use docker compose to pull up Reference: https://github.com/ByConity/byconity-docker
K8s cluster version mode
- Use K8s deployment reference: https://github.com/ByConity/byconity-deploy
Physical machine deployment mode
- Deploy on a physical machine using the package manager: https://github.com/ByConity/ByConity/tree/master/packages
Source code compilation method
- Reference: https://github.com/ByConity/ByConity#build-byconity

ByConity's Future Open-Source Plan

ByConity includes several key milestones in its open-source community roadmap through 2023. These milestones are designed to enhance ByConity's functionality, performance, and ease of use. Among them, the development of new storage engines, support for more data types, and integration with other data management tools are some important areas of focus. We have listed the following directions, and we have created an issue on Github: https://github.com/ByConity/ByConity/issues/26, inviting community partners to join us to discuss co-construction:

In terms of performance improvement: ByConity hopes to continue to improve performance, and here are three technical directions. One is to use indexes for acceleration, which includes four aspects:
- Optimize the existing skip index;
- Explore the implementation of new index research, such as zorder-index and inverted index;
- ByConity builds and accelerates Hive table indexes
- Index recommendation and conversion, lowering the threshold for users to use The second is the continuous optimization of the query optimizer; the third is that ByConity's cache mechanism is local, and each computing group can only access its own cache. In the future, it is hoped to implement a distributed cache mechanism to further improve the cache hit rate.
Stability improvement: There are two aspects here. One is to support resource isolation in more dimensions. Currently, it only supports resource isolation in the computing group dimension. In the next step, resource isolation will also be supported on the server side, providing better end-to-end Guaranteed multi-tenancy capability. The second direction is to enrich metrics and improve observability and problem diagnosis capabilities.
Enterprise-level feature enhancements: We hope to achieve more detailed permission control, including column-level permission control. The other is to improve the functions related to data security, such as data backup and recovery and data end-to-end encryption. Finally, we continue to explore the deep compression of data to save storage costs.
Ecological compatibility improvement: This direction is the most important point. ByConity plans to support more types of storage backends, such as AWS's S3, Volcano Engine's object storage, etc. In terms of improving ecological compatibility, it includes integration with some drivers and some open source software. At the same time, we also hope to support federated queries of data lakes, such as Hudi, Iceberg, etc.

In short, ByConity is an open source cloud-native data warehouse that provides read-write separation, elastic expansion and contraction, tenant resource isolation, and strong consistency of data read and write. Its storage-computing separation architecture, combined with mainstream OLAP engine optimization, ensures excellent read and write performance. As ByConity continues to develop and improve, it is expected to become an important tool for cloud-native data warehouses in the future.

We have a video that introduces ByConity in detail, including a demo of ByConity. If you need more information, you can check the following link: https://www.bilibili.com/video/BV15k4y1b7pw/?spm_id_from=333.999.0.0&vd_source=71f3be2102fec1a0171b49a530cefad0

Scan the QR code to reply [name + QR code] Join the ByConity communication group to get more project dynamics and activity information.

ByConity Community QR Code

Background​

ETL Scenarios and Solutions​

Differences between ELT and ETL​

Challenges of Resource Duplication​

Industry Solutions​

ELT in ByConity​

Overall Execution Flow​

System Requirements for ELT Tasks:​

Stage-level Scheduling​

Principle Analysis​

Benefits​

Adaptive Scheduler​

Query Queue Mechanism​

Async Execution​

Future Plans​

Fault Recovery Capabilities​

Resources​

Introduction​

Introduction to TPC-DS Benchmark Test​

Introduction to OLAP Engines​

Test Environment and Methodology​

Test Environment Configuration​

Server Configuration​

Test Methodology​

Performance Test Results​

Basic Query Scenario​

Join Query Scenario​

Aggregation Query Scenario​

Subquery Scenario​

Window Function Query Scenario​

Conclusion​

Join Us​

History of ByConity

Expansion and contraction:​

Multi-tenancy and shared cluster environment:​

Performance limitations:​

Features of ByConity

Storage-Computing Separation​

Resource Isolation​

Elastic Scaling​

Strong Data Consistency​

High Performance​

Technical Architecture of ByConity

Working Principle of ByConity

Stage 1: Query Request​

Stage 2: Plan Scheduling​

Stage 3: Query Execution​

Main Component Library​

Metadata Management​

Query Optimizer​

Query Scheduling​

Computing Group​

Virtual File System​

Cache Acceleration​

How to Deploy Byconity

ByConity's Future Open-Source Plan

Working with the Community

Summary

Introduction to ByConity​

ByConity's History​

Features of ByConity

Storage-Computing Separation​

Elastic Scaling​

Resource Isolation​

Strong Data Consistency​

High Performance​

ByConity's Technical Architecture

Working Principle of ByConity

Stage 1: Query Request​

Stage 2: Plan Scheduling​

Stage 3: Query Execution​

Main Component Library

Metadata Management​

Query Optimizer​

Query Scheduling​

Computing Group​

Virtual File System​

Cache Acceleration​

How to Obtain and Deploy​

ByConity's Future Open-Source Plan​

Background

ETL Scenarios and Solutions

Differences between ELT and ETL

Challenges of Resource Duplication

Industry Solutions

ELT in ByConity

Overall Execution Flow

System Requirements for ELT Tasks:

Stage-level Scheduling

Principle Analysis

Benefits

Adaptive Scheduler

Query Queue Mechanism

Async Execution

Future Plans

Fault Recovery Capabilities

Resources

Introduction

Introduction to TPC-DS Benchmark Test

Introduction to OLAP Engines

Test Environment and Methodology

Test Environment Configuration

Server Configuration

Test Methodology

Performance Test Results

Basic Query Scenario

Join Query Scenario

Aggregation Query Scenario

Subquery Scenario

Window Function Query Scenario

Conclusion

Join Us

Expansion and contraction:

Multi-tenancy and shared cluster environment:

Performance limitations:

Storage-Computing Separation

Resource Isolation

Elastic Scaling

Strong Data Consistency

High Performance

Stage 1: Query Request

Stage 2: Plan Scheduling

Stage 3: Query Execution

Main Component Library

Metadata Management

Query Optimizer

Query Scheduling

Computing Group

Virtual File System

Cache Acceleration

Introduction to ByConity

ByConity's History

Storage-Computing Separation

Elastic Scaling

Resource Isolation

Strong Data Consistency

High Performance

Stage 1: Query Request

Stage 2: Plan Scheduling

Stage 3: Query Execution

Metadata Management

Query Optimizer

Query Scheduling

Computing Group

Virtual File System

Cache Acceleration

How to Obtain and Deploy

ByConity's Future Open-Source Plan