跳到主要内容

· 阅读需 15 分钟
Tao Wang
Yunbo Wang

背景

谈到数据仓库, 一定离不开使用Extract-Transform-Load (ETL)或 Extract-Load-Transform (ELT)。 将来源不同、格式各异的数据提取到数据仓库中,并进行处理加工。传统的数据转换过程一般采用Extract-Transform-Load (ETL)来将业务数据转换为适合数仓的数据模型,然而,这依赖于独立于数仓外的ETL系统,因而维护成本较高。 ByConity 作为云原生数据仓库,从0.2.0版本开始逐步支持 Extract-Load-Transform (ELT),使用户免于维护多套异构数据系统。本文将介绍 ByConity 在ELT方面的能力规划,实现原理和使用方式等。

ETL场景和方案

ELT与ETL的区别

  • ETL:是用来描述将数据从来源端经过抽取、转置、加载至目的端(数据仓库)的过程。Transform通常描述在数据仓库中的前置数据加工过程。

  • ELT 专注于将最小处理的数据加载到数据仓库中,而把大部分的转换操作留给分析阶段。相比起前者(ETL),它不需要过多的数据建模,而给分析者提供更灵活的选项。ELT已经成为当今大数据的处理常态,它对数据仓库也提出了很多新的要求。

资源重复的挑战

典型的数据链路如下:我们将行为数据、日志、点击流等通过MQ/ Kafka/ Flink将其接入存储系统当中,存储系统又可分为域内的HDFS 和云上的 OSS& S3 这种远程储存系统,然后进行一系列的数仓的ETL操作,提供给 OLAP系统完成分析查询。 但有些业务需要从上述的存储中做一个分支,因此会在数据分析的某一阶段,从整体链路中将数据导出,做一些不同于主链路的ETL操作,会出现两份数据存储。其次在这过程中也会出现两套不同的ETL逻辑。 当数据量变大,计算冗余以及存储冗余所带来的成本压力也会愈发变大,同时,存储空间的膨胀也会让弹性扩容变得不便利。

业界解决思路

在业界中,为了解决以上问题,有以下几类流派:

  • 数据预计算流派:如Kylin等。如果Hadoop系统中出报表较慢或聚合能力较差,可以去做一个数据的预计算,提前将配的指标的cube或一些视图算好。实际SQL查询时,可以直接用里面的cube或视图做替换,之后直接返回。
  • 流批一体派:如 Flink、Risingwave。在数据流进时,针对一些需要出报表或者需要做大屏的数据直接内存中做聚合。聚合完成后,将结果写入HBase或MySQL中再去取数据,将数据取出后作展示。Flink还会去直接暴露中间状态的接口,即queryable state,让用户更好的使用状态数据。但是最后还会与批计算的结果完成对数,如果不一致,需要进行回查操作,整个过程考验运维/开发同学的功力。
  • 湖仓一体&HxxP:将数据湖与数据仓库结合起来。

ELT in ByConity

整体执行流程

ELT任务对系统的要求:

  1. 整体易扩展:导入和转换通常需要大量的资源,系统需要通过水平扩展的方式来满足数据量的快速增长。
  2. 可靠性和容错能力:大量的job能有序调度;出现task偶然失败(OOM)、container失败时,能够拉起重试;能处理一定的数据倾斜
  3. 效率&性能:有效利用多核多机并发能力;数据快速导入;内存使用有效(内存管理);CPU优化(向量化、codegen)
  4. 生态&可观测性:可对接多种工具;任务状态感知;任务进度感知;失败日志查询;有一定可视化能力

ByConity 针对ELT任务的要求,以及当前场景遇到的困难,新增了以下特性和优化改进。

分阶段执行(Stage-level Scheduling)

原理解析

  • 当前 ClickHouse的 SQL 执行过程如下:
    • 第一阶段,Coordinator 收到分布式表查询后将请求转换为对 local 表查询发送给每个 shard 节点;
    • 第二阶段,Coordinator 收到各个节点的结果后汇聚起来处理后返回给客户端;
  • ClickHouse 将Join操作中的右表转换为子查询,带来如下几个问题都很难以解决:
    • 复杂的query有多个子查询,转换复杂度高;
    • Join表较大时,容易造成worker节点的OOM;
    • 聚合阶段在Cooridnator,压力大,容易成为性能瓶颈;

不同于ClickHouse,我们在ByConity 中实现了对复杂查询的执行优化。通过对执行计划的切分,将之前的两阶段执行模型转换为分阶段执行。在逻辑计划阶段,根据算子类型插入exchange算子。执行阶段根据exchange算子将整个执行计划进行DAG切分,并且分stage进行调度。stage之间的exchange算子负责完成数据传输和交换。 关键节点:

  1. exchange节点插入
  2. 切分stage
  3. stage scheduler
  4. segment executer
  5. exchange manager

这里重点来讲一下exchange的视线。上图可以看到,最顶层的是query plan。下面转换成物理计划的时候,我们会根据不同的数据分布的要求转换成不同的算子。source层是接收数据的节点,基本都是统一的,叫做ExchangeSource。Sink则有不同的实现,BroadcastSink、Local、PartitionSink等,他们是作为map task的一部分去运行的。如果是跨节点的数据操作,我们在底层使用统一的brpc流式数据传输,如果是本地,则使用内存队列来实现。针对不同的点,我们进行了非常细致的优化:

  • 数据传输层
    • 进程内通过内存队列,无序列化,zero copy
    • 进程间使用brpc stream rpc,保序、连接复用、状态码传输、压缩等
  • 算子层
    • 批量发送
    • 线程复用,减少线程数量

带来的收益

因为ByConity 彻底采用了多阶段的查询执行方式,整体有很大的收益:

  • Cooridnator更稳定、更高效
    • 聚合等算子拆分到worker节点执行
    • Cooridnator节点只需要聚合最终结果
  • Worker OOM减少
    • 进行了stage切分,每个stage的计算相对简单
    • 增加了exchange算子,减少内存压力
  • 网络连接更加稳定、高效
    • exchange算子有效传输
    • 复用连接池

自适应的调度器(Adaptive Scheduler)

Adaptive Scheduler 属于我们在稳定性方面所做的特性。在OLAP场景中可能会发现部分数据不全或数据查询超时等,原因是每个worker是所有的query共用的,这样一旦有一个worker 较慢就会导致整个query的执行受到影响。

计算节点共用存在的问题:

  • Scan 所在的节点负载和不同查询所需的扫描数据量相关,做不到完全平均;
  • 各 Plan Segment 所需资源差异大; 这就导致worker节点之间的负载严重不均衡。负载较重的worker节点就会影响query整体的进程。因此我们做了以下的优化方案:
  • 建立 Worker 健康度机制。Server 端建立 Worker 健康度管理类,可以快速获取 Worker Group 的健康度信息,包括CPU、内存、运行Query数量等信息。
  • 自适应调度:每个SQL 根据 Worker 健康度动态的进行选择以及计算节点并发度控制。

查询的队列机制(Query Queue)

我们的集群也会出现满载情况,即所有的worker都是不健康的或者满载/超载的,就会用查询队列来进行优化。 我们直接在server端做了一个manager。每次查询的时候manager会去check集群的资源,并且持有一个锁。如果资源不够用,则等待资源释放后去唤醒这个锁。这就避免了Server端不限制的下发计算任务,导致worker节点超载,然后崩掉的情况。 当前实现相对简单。server是多实例,每个server实例中都有queue,所持有的是一个局部视角,缺乏全局的资源视角。除此之外,每个queue中的查询状态没有持久化,只是简单的缓存在内存中。 后续,我们会增加server之间的协调,在一个全局的视角上对查询并发做限制。也会对server实例中query做持久化,增加一些failover的场景支持。

异步执行(Async Execution)

ELT任务的一个典型特征就是:相对于即时分析,他们的运行时间会相对较长。一般ELT任务执行时长为分钟级,甚至到达小时级。 目前 ClickHouse的客户端查询都采用阻塞的方式进行返回。这样就造成了客户端长期处于等待的情况,而在这个等待过程中还需要保持和服务端的连接。在不稳定的网络情况下,客户端和服务端的连接会断开,从而导致服务端的任务失败。 为了减少这种不必要的失败,以及减少客户端为了维持连接的增加的复杂度。我们开发了异步执行的功能,它的实现如下:

  1. 用户指定异步执行。用户可以通过settings enable_async_query = 1的方式进行per query的指定。也可以通过set enable_async_query = 1的方式进行session级别的指定。
  2. 如果是异步query,则将其放到后台线程池中运行
  3. 静默io。当异步query执行时,则需要切断它和客户端的交互逻辑,比如输出日志等。

针对query的初始化还是在session的同步线程中进行。一旦完成初始化,则将query状态写入到metastore,并向客户端返回async query id。客户端可以用这个id查询query的状态。async query id返回后,则表示完成此次查询的交互。这种模式下,如果语句是select,那么后续结果则无法回传给客户端。这种情况下我们推荐用户使用async query + select...into outfile的组合来满足需求。

未来规划

针对ELT混合负载,ByConity 0.2.0版本目前只是牛刀小试。后续的版本中我们会持续优化查询相关的能力,ELT为核心的规划如下:

故障恢复能力

  • 算子Spill
    • Sort、Agg、Join 算子Spill;
    • Exchange Spill 能力;
  • Recoverability 容错恢复
    • 算子执行恢复:ELT任务运行时长较长时,中间 Task的偶发失败会导致整个Query失败,支持Task 级别重试可以极大地降低环境原因导致的偶发失败;
    • Stage重试:当节点失败时,可以进行 Stage级别的重试;
    • 保存队列作业状态的能力;
  • Remote Shuffle Service:当前业界开源的 shuffle service通常为Spark定制,没有通用的客户端,比如c++客户端。后续我们会补充这部分能力。

资源

  • 计算资源可指定:用户可指定query需要的计算资源;
  • 计算资源预估/预占:可动态预估query需要的计算资源,并通过预占的方式进行调配;
  • 动态申请资源:当前worker均为常驻进程/节点。动态申请资源可以提高利用率;
  • 更细粒度的资源隔离:通过worker group或者进程级别的隔离,减少各query之间相互影响;

· 阅读需 13 分钟
Yunbo Wang

引言

随着数据量和数据复杂性的不断增加,越来越多的企业开始使用OLAP(联机分析处理)引擎来处理大规模数据并提供即时分析结果。在选择OLAP引擎时,性能是一个非常重要的因素。因此,本文将使用TPC-DS基准测试的99个查询语句来对比开源的ClickHouse、Doris、Presto以及ByConity这4个OLAP引擎的性能表现,以便为企业选择合适的OLAP引擎提供参考。

TPC-DS基准测试简介

TPC-DS(Transaction Processing Performance Council Decision Support Benchmark)是一个面向决策支持系统(Decision Support System,简称DSS)的基准测试,该工具是由TPC组织开发,它模拟了多维分析和决策支持场景,并提供了99个查询语句,用于评估数据库系统在复杂的多维分析场景下的性能。每个查询都设计用于模拟复杂的决策支持场景,包括跨多个表的连接、聚合和分组、子查询等高级SQL技术。

OLAP引擎介绍

ClickHouse、Doris、Presto和ByConity都是当前比较流行的开源OLAP引擎,它们都具有高性能和可扩展性的特点。

  • ClickHouse是由俄罗斯搜索引擎公司Yandex开发的一个列式数据库管理系统,它专注于大规模数据的快速查询和分析。
  • Doris是一个分布式列式存储和分析系统,它支持实时查询和分析,并可以与Hadoop、Spark和Flink等大数据技术进行集成。
  • Presto是一个分布式SQL查询引擎,它由Facebook开发,可以在大规模数据集上进行快速查询和分析。
  • ByConity是由字节开源的云原生数仓,采用了存储计算分离的架构,实现租户资源隔离、弹性扩缩容,并具有数据读写的强一致性等特性,它支持主流的OLAP引擎优化技术,读写性能非常优异。 本文将使用这四个OLAP引擎对TPC-DS基准测试的99个查询语句进行性能测试,并对比它们在不同类型的查询中的性能差异。

测试环境和方法

测试环境配置:

  • 环境配置 Memory: 256GB Disk: ATA, 7200rpm, partitioned:gpt System: Linux 4.14.81.bm.30-amd64 x86_64, Debian GNU/Linux 9

  • 测试数据量 使用1TB的数据表,相当于28亿行数据量级

软件名称版本号发布时间节点数其他配置
Clickhouse23.4.1.19432023-04-265 个 Workerdistributed_product_mode = 'global', partial_merge_join_optimizations = 1
Doris1.2.4.12023-04-275 个 Worker,1 个 Coordinatorbucket 配置:维表 1,returns 表 10-20,sales 表 100-200
Presto0.28.02023-03-165 个 Worker,1 个 ServerHive Catalog,ORC format,Xmx200GB,enable_optimizer=1,dialect_type='ANSI'
ByConity0.1.0-GA2023-03-155 个 Workerenable_optimizer=1, dialect_type='ANSI'

服务器配置:

Architecture:          x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2494.435
CPU max MHz: 2900.0000
CPU min MHz: 1200.0000
BogoMIPS: 4389.83
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47

测试方法:

  • 使用TPC-DS基准测试的99个查询语句,和1TB(28亿行)的数据测试4个OLAP引擎的性能。
  • 在每个引擎中使用相同的测试数据集,并保持相同的配置和硬件环境。
  • 对于每个查询,多次执行并取平均值,以减少测量误差,设置每次查询超时时间为500秒。
  • 记录查询执行的细节,例如查询执行计划、I/O和CPU使用情况等。

性能测试结果

我们使用了相同的数据集和硬件环境来测试这四个OLAP引擎的性能。测试数据集大小为1TB,硬件和软件环境如上介绍,我们使用了TPC-DS基准测试中的99个查询语句分别在四个OLAP引擎上进行了连续三次的测试,并取三次平均结果。其中ByConity跑通了所有99个查询测试。Doris在SQL15出现Crash,另外有4次的Timeout,分别是SQL54、SQL67、SQL78和SQL95。Presto只在SQL67和SQL72发生Timeout,其他查询测试都跑通了。而Clickhouse只跑通了50%的查询语句,大概有一部分是Timeout,另一部分是系统报错,分析原因是Clickhouse不能有效的支持多表关联查询导致,只能把这类SQL语句做手动改写拆分才能执行。因此在对比总耗时我们暂时排除Clickhouse,其他三个OLAP引擎TPC-DS测试总耗时如下图1所示,从图1 中我们可以看出开源的ByConity查询性能明显优于其他引擎,性能约是其他的3-4倍。(注:以下所有图表纵坐标单位为秒)

图1 TPC-DS 99条查询总耗时

针对TPC-DS基准测试的99个查询语句,我们接下来按照查询场景的不同进行分类,例如基础查询、连接查询、聚合查询、子查询、窗口函数查询等。下面我们将使用这些分类方式来对ClickHouse、Doris、Presto和ByConity四个OLAP引擎进行性能分析对比:

基础查询场景下

该场景包含简单的查询操作,例如从单个表中查询数据,过滤和排序结果等。基础查询的性能测试主要关注处理单个查询的能力。其中ByConity的表现最佳,Presto和Doris的性能也表现都不错,这是因为基础查询通常只涉及到少量的数据表和字段,因此能够充分利用Presto和Doris的分布式查询特性和内存计算能力,Clickhouse对多表关联支持不好,出现一些跑不通的现象,其中SQL5、8、11、13、14、17、18均超时,我们按Timeout=500秒计算,但希望显示更清晰截取Timeout=350秒。下图2 是基础查询场景下四个引擎的平均查询时间:

图2 TPC-DS 基础查询的性能对比

连接查询场景

连接查询是常见的多表查询场景,它通常使用JOIN语句连接多个表,并根据指定条件进行数据检索。如图3 我们看到ByConity的性能最佳,主要得益于对查询优化器的优化,引入了基于代价的优化能力(CBO),在多表Join时候进行re-order的等优化操作。其次是Presto和Doris,Clickhouse在多表Join的效果相比其他三个性能不是很好,且对很多复杂语句的支持不够好。

图3 TPC-DS连接查询的性能对比

聚合查询场景

聚合查询是对数据进行统计计算的场景,例如测试SUM、AVG、COUNT等聚合函数的使用。ByConity依然表现优异,其次是Doris和Presto,Clickhouse出现了四次Timeout,为了方便看出差异,我们截取Timeout值到250秒。

图4 TPC-DS聚合查询的性能对比

子查询场景

子查询是在SQL语句中嵌套使用的查询场景,它通常作为主查询的条件或限制条件。如下图5所示,ByConity表现最佳,原因是ByConity实现了基于规则的优化能力(RBO)进行查询优化,通过算子下推、列裁剪和分区裁剪等技术,把复杂的嵌套查询进行整体优化,替除所有的子查询,把常见算子转化成Join+Agg的形式。其次是Doris和Presto表现相对较好,但Presto在SQL68和SQL73出现Timeout,Doris也在3个SQL查询出现Timeout,Clickhouse同样出现了部分超时和系统报错,原因上面有提到。同样为方便看出差异,我们截取Timeout值等于250秒。

图5 TPC-DS子查询的性能对比

窗口函数查询场景

窗口函数查询是一种高级的SQL查询场景,它可以在查询结果中进行排名、分组、排序等操作。如下图6所示,ByConity的性能最优,其次是Presto,Doris出现了一次Timeout的情况,Clickhouse依然有部分没有跑通TPC-DS测试。

图6 TPC-DS窗口函数查询的性能对比

总结

本文对ClickHouse、Doris、Presto和ByConity四个OLAP引擎在TPC-DS基准测试的99个查询语句下的性能进行了分析和比较。我们发现,在不同的查询场景下,四个引擎的性能表现存在差异。ByConity在所有TPC-DS的99个查询场景下都表现优异,超过其他三个OLAP引擎;Presto和Doris在连接查询、聚合查询和窗口函数查询场景下表现较好;由于Clickhouse的设计和实现并不是专门针对关联查询进行优化,因此在多表关联查询方面整体表现差强人意。 需要注意的是,性能测试结果取决于多个因素,包括数据结构、查询类型、数据模型等。在实际应用中,需要综合考虑各种因素,以选择最适合自己的OLAP引擎。 在选择OLAP引擎时,还需要考虑其他因素,如可扩展性、易用性、稳定性等。在实际应用中,需要根据具体业务需求进行选择,并对引擎进行合理的配置和优化,以获得最佳的性能表现。 总之,ClickHouse、Doris、Presto、ByConity都是非常优秀的OLAP引擎,具有不同的优点和适用场景。在实际应用中,需要根据具体业务需求进行选择,并进行合理的配置和优化,以获得最佳的性能表现。同时,需要注意选择具有代表性的查询场景和数据集,并针对不同的查询场景进行测试和分析,以便更全面地评估引擎的性能。

加入我们

ByConity社区拥有大量的用户,同时是一个非常开放的社区,我们邀请大家和我们一起讨论共建,在Github上建立了issue:https://github.com/ByConity/ByConity/issues/26,也可以加入我们的飞书群、Slack或者Discord参与交流。

· 阅读需 12 分钟
Vini Jaiswal

ByConity is an open-source cloud-native data warehouse developed by ByteDance. It utilizes a computing-storage separation architecture and offers various essential features, including the separation of computing and storage, elastic scalability, tenant resource isolation, and strong consistency in data reading and writing. By leveraging optimizations from popular OLAP engines like column storage, vectorized execution, MPP execution, and query optimization, ByConity delivers exceptional read and write performance.

History of ByConity

The origins of ByConity can be traced back to 2018 when ByteDance initially implemented ClickHouse for internal use. As the business grew, the data volume increased significantly to cater to a large user base. However, ClickHouse's Shared-Nothing architecture, where each node operates independently without sharing storage resources, posed certain challenges during its usage. Here are some of the issues encountered:

Expansion and contraction:

Due to the tight coupling of computing and storage resources in ClickHouse, scaling the system incurred higher costs and involved data migration. This prevented real-time on-demand scalability, resulting in inefficient resource utilization.

Multi-tenancy and shared cluster environment:

ClickHouse's tightly coupled architecture led to interactions among multiple tenants in a shared cluster environment. Since reading and writing operations were performed on the same node, they often interfered with each other, impacting overall performance.

Performance limitations:

ClickHouse's support for complex queries, such as multi-table join operations, was not optimal, which hindered the system's ability to handle such queries efficiently.

To address these pain points, ByteDance undertook an architectural upgrade of ClickHouse. In 2020, we initiated the ByConity project internally. After releasing the Beta version in January 2023, the project was officially made available to the public at the end of May 2023.

Figure 1 ByteDance ClickHouse usage

Figure 1: ByteDance ClickHouse Usage

Features of ByConity

ByConity implements a computing and storage separation architecture that transforms the original local management of computing and storage on individual nodes. Instead, it adopts a unified management approach for all data across the entire cluster using distributed storage. This transformation results in stateless computing nodes, enabling dynamic expansion and contraction by leveraging the scalability of distributed storage and the stateless nature of computing nodes. ByConity offers several crucial features that enhance its functionality and performance:

Storage-Computing Separation

One of the critical advantages of ByConity is its storage-computing separation architecture, which enables read-write separation and elastic scaling. This architecture ensures that read and write operations do not affect each other, and computing resources and storage resources can be independently expanded and contracted on demand, ensuring efficient resource utilization. ByConity also supports multi-tenant resource isolation, making it suitable for multi-tenant environments.

Figure 2: ByConity storage-computing separation to achieve multi-tenant isolation Figure 2: ByConity storage-computing separation to achieve multi-tenant isolation

Resource Isolation

ByConity provides resource isolation, ensuring that different tenants have separate and independent resources. This prevents interference or impact between tenants, promoting data privacy and efficient multi-tenancy support.

Elastic Scaling

ByConity supports elastic expansion and contraction, allowing for real-time and on-demand scaling of computing resources. This flexibility ensures efficient resource utilization and enables the system to adapt to changing workload requirements.

Strong Data Consistency

ByConity ensures strong consistency of data read and write operations. This ensures that data is always up-to-date and eliminates any inconsistencies between read and write operations, guaranteeing data integrity and accuracy.

High Performance

ByConity incorporates optimization techniques from mainstream OLAP engines, such as column storage, vectorized execution, MPP execution, and query optimization. These optimizations enhance read and write performance, enabling faster and more efficient data processing and analysis.

Technical Architecture of ByConity

ByConity follows a three-layer architecture consisting of:

  1. Service access layer: The service access layer, represented by ByConity Server, handles client data and service access.
  2. Computing layer: The computing layer comprises multiple computing groups, where each Virtual Warehouse (VW) functions as a computing group.
  3. Data storage layer: The data storage layer utilizes distributed file systems like HDFS and S3.

Figure 3: ByConity's architecture Figure 3: ByConity's architecture

Working Principle of ByConity

ByConity is a powerful open-source cloud-native data warehouse that adopts a storage-computing separation architecture. In this section, we will examine the working principle of ByConity and the interaction process of each component of ByConity through the complete life cycle of a SQL.

Figure 4: ByConity internal component interaction diagram Figure 4: ByConity internal component interaction diagram

Figure 4 depicts the interaction diagram of ByConity's components. In the figure, the dotted line represents the inflow of a SQL query, the double-headed arrow indicates component interaction, and the one-way arrow represents data processing and output to the client. Let's explore the interaction process of each component in ByConity throughout the complete lifecycle of a SQL query.

ByConity's working principle can be divided into three stages:

Stage 1: Query Request

The client submits a Query request to the server. The server initially performs parsing and subsequently analyzes and optimizes the query through the Query Analyzer and Optimizer to generate an efficient executable plan. To access the required metadata, which is stored in a distributed key-value (KV) store, ByConity leverages FoundationDB and reads the metadata from the Catalog.

Stage 2: Plan Scheduling

ByConity passes the optimized executable plan to the Plan Scheduler component. The scheduler accesses the Resource Manager to obtain available computing resources and determines which nodes to schedule the query tasks for execution.

Stage 3: Query Execution

The Query request is executed on ByConity's Workers. The Workers read data from the underlying Cloud Storage and perform computations by establishing a Pipeline. The server then aggregates the calculation results from multiple Workers and returns them to the client.

Additionally, ByConity includes two main components: Time-stamp Oracle and Daemon Manager. The time-stamp oracle supports transaction processing, while the daemon manager manages and schedules subsequent tasks.

Main Component Library

To better understand the working principle of ByConity, let's take a look at the main components of ByConity:

Metadata Management

ByConity offers a highly available and high-performance metadata read and write service called the Catalog Server. It supports complete transaction semantics (ACID). Furthermore, we have designed the Catalog Server with a flexible architecture, allowing for the pluggability of backend storage systems. Currently, we support Apple's open-source FoundationDB, and there is potential for extending support to other backend storage systems in the future.

Query Optimizer

The query optimizer plays a crucial role in the performance of a database system. A well-designed optimizer can significantly enhance query performance, particularly in complex query scenarios, where it can achieve performance improvements ranging from several times to hundreds of times. ByConity's self-developed optimizer focuses on improving optimization capabilities through two main approaches:

  • RBO (Rule-Based Optimization): This capability encompasses various optimizations such as column pruning, partition pruning, expression simplification, subquery dissociation, predicate pushdown, redundant operator elimination, Outer-Join to Inner-Join conversion, operator pushdown storage, distributed operator splitting, and other heuristic optimization techniques.
  • CBO (Cost-Based Optimization): ByConity's optimizer also includes cost-based optimization capabilities. This includes support for join reorder, outer-join reorder, join/agg reorder, common table expressions (CTE), materialized views, dynamic filter push-down, magic set optimization, and other cost-based techniques. Additionally, it integrates property enforcement for distributed planning.

Query Scheduling

ByConity currently supports two query scheduling strategies: Cache-aware scheduling and Resource-aware scheduling.

  • The cache-aware scheduling focuses on scenarios where computing and storage are separated. Its objective is to maximize cache utilization and minimize cold reads. This strategy aims to schedule tasks to nodes that have corresponding data caches, enabling computations to leverage the cache and improve read and write performance. Additionally, considering the dynamic expansion and contraction of the system, cache-aware scheduling strives to minimize the impact of cache failure on query performance when the computing group's topology changes.
  • Resource-aware scheduling analyzes the resource usage of different nodes within the computing group across the entire cluster. It performs targeted scheduling to optimize resource utilization. Moreover, resource-aware scheduling incorporates flow control mechanisms to ensure rational resource utilization and prevent negative effects caused by overload, such as system downtime.

Computing Group

ByConity enables different tenants to utilize distinct computing resources, as depicted in Figure 5. With ByConity's architecture, implementing features like multi-tenant isolation and read-write separation becomes straightforward. Each tenant can leverage separate computing groups to achieve multi-tenant isolation and support read-write separation. The computing groups can be dynamically expanded and contracted on-demand, ensuring efficient resource utilization. During periods of low resource utilization, resource sharing can be employed, allowing computing groups to be allocated to other tenants to maximize resource utilization and minimize costs.

Virtual File System

The virtual file system module serves as an intermediary layer for data reading and writing. ByConity has optimized this module to provide a "storage as a service" capability to other modules. The virtual file system offers a unified file system abstraction, shielding the underlying different back-end implementations. It facilitates easy expansion and supports multiple storage systems, such as HDFS or object storage.

Cache Acceleration

ByConity utilizes caching to accelerate query processing. Under the computing-storage separation architecture, cache acceleration is performed in both the metadata and data dimensions. In the metadata dimension, ByConity caches Table and Partition information in the memory of the server-side (ByConity Server). In the data dimension, cache acceleration occurs on the Worker side within the computing group. This hierarchical caching mechanism utilizes both memory and disk, with Mark collection serving as the cache granularity. These caching strategies effectively enhance query speed and performance.

How to Deploy Byconity

ByConity currently supports four acquisition and deployment modes. Community developers are welcome to use them and submit issues to us:

ByConity's Future Open-Source Plan

ByConity includes several key milestones in its open-source community roadmap through 2023. These milestones are designed to enhance ByConity's functionality, performance, and ease of use. Among them, the development of new storage engines, support for more data types, and integration with other data management tools are some important areas of focus. We have listed the following directions, and created an issue on Github: https://github.com/ByConity/ByConity/issues/26, inviting the community to join us to discuss co-development:

  • Performance improvement: ByConity aims to boost performance through various optimizations. This includes leveraging indexes for acceleration, such as Skip-index optimization, support for new Zorder-index and inverted indexes. ByConity will also focus on the construction and acceleration of external indexes, as well as the automatic recommendation and conversion of indexes. Continuous enhancements to the query optimizer and the implementation of a distributed caching mechanism are also part of the performance improvement efforts.
  • Stability improvement: There are two aspects here.
    • One is to support resource isolation in more dimensions. ByConity is committed to improving stability by extending resource isolation capabilities in multiple dimensions, thereby providing better multi-tenant support.
    • The second direction is to enrich metrics and improve observability and problem diagnosis capabilities, ensuring a stable and reliable experience for users.
  • Enterprise-level feature enhancements: ByConity aims to introduce finer-grained authority control, improve data security-related functions such as backup, recovery, and data encryption and continue to explore techniques for deep compression of data to save storage costs.
  • Ecosystem compatibility improvement: ByConity plans to expand its compatibility with various storage systems, including popular object storage solutions like S3 and TOS. It plans to enhance the overall compatibility and integration capabilities, facilitating seamless integration with other tools and frameworks. Moreover, it aims to support data lake federation queries, enabling interaction with technologies like Hudi, Iceberg, and more.

Working with the Community

Since the release of the Beta version, ByConity has received support from numerous enterprise developers, including Huawei, Electronic Cloud, Zhanxinzhanli, Tianyi Cloud, Vipshop, and Transsion Holdings. These organizations have actively contributed by deploying ByConity in their respective environments, undergoing TPC-DS verification, and conducting tests in their business scenarios. The results have been promising, and their feedback has provided valuable insights for improvement, which we greatly appreciate.

We are delighted to receive the ideas and willingness of community partners to build together. We have already initiated joint development efforts with Huawei Terminal Cloud. Our collaborative endeavors will focus on various areas, such as Kerberos authentication, ORC support, and integration with S3 storage.

If you are interested in joining our community and participating in the development of ByConity, we invite you to visit our GitHub repository at https://github.com/ByConity/ByConity. You can find more information and details about our ongoing projects, contribute your ideas, and collaborate with us to further enhance ByConity. To get involved, simply scan the QR code provided below to join our Discord or follow us on Twitter.

ByConity Discord Group ByConity Discord Group

ByConity Twitter ByConity Twitter

Summary

In summary, ByConity is an open source cloud-native data warehouse that offers features such as read-write separation, elastic expansion and contraction, tenant resource isolation, and strong data read and write consistency. It utilizes a computing-storage separation architecture and leverages optimizations from mainstream OLAP engines to deliver excellent read and write performance. As ByConity continues to evolve and improve, it aims to become a key tool for cloud-native data warehousing in the future.

· 阅读需 10 分钟
Zhaojie Niu
Yunbo Wang

Introduction to ByConity

ByConity is an open-source cloud-native data warehouse that adopts a storage-computing separation architecture. It supports several key features, including separation of storage and computing, elastic expansion and contraction, isolation of tenant resources, and strong consistency of data read and write. By utilizing mainstream OLAP engine optimizations, such as column storage, vectorized execution, MPP execution, query optimization, etc., ByConity can provide excellent read and write performance.

ByConity's History

The background of ByConity can be traced back to 2018 when ByteDance began to use ClickHouse internally. Due to the development of business, the scale of data has become larger and larger to serve a large number of users. However, because ClickHouse is a Shared-Nothing architecture, each node is independent and does not share storage resources, so computing resources and storage resources are tightly coupled. This leads to a higher cost of expansion and contraction, and involves data migration, which prevents real-time and on-demand expansion and contraction, resulting in a waste of resources. Furthermore, the tightly coupled architecture of ClickHouse will cause multi-tenants to interact with each other in the shared cluster. In addition, because reading and writing are completed on one node, reading and writing are affected. Finally, ClickHouse does not support performance in complex queries such as multi-table join. Based on these pain points, the ByConity project was launched in January 2020.

The ByConity team hopes to give the project back to the community and improve it through the power of open source. In January 2023, ByConity was officially open-sourced, and the beta version was released.

Figure 1 ByteDance ClickHouse usage

Figure 1: ByteDance ClickHouse Usage

Features of ByConity

ByConity has several key features that make it a powerful open-source cloud-native data warehouse.

Storage-Computing Separation

One of the critical advantages of ByConity is its storage-computing separation architecture, which enables read-write separation and elastic scaling. This architecture ensures that read and write operations do not affect each other, and computing resources and storage resources can be independently expanded and contracted on demand, ensuring efficient resource utilization. ByConity also supports multi-tenant resource isolation, making it suitable for multi-tenant environments.

Figure 2: ByConity storage-computing separation to achieve multi-tenant isolation Figure 2: ByConity storage-computing separation to achieve multi-tenant isolation

Elastic Scaling

ByConity supports flexible expansion and contraction, enabling real-time and on-demand expansion and contraction of computing resources, ensuring efficient use of resources.

Resource Isolation

ByConity isolates the resources of different tenants, ensuring that tenants are not affected by each other.

Strong Data Consistency

ByConity ensures strong consistency of data read and write, ensuring that data is always up to date with no inconsistencies between reads and writes.

High Performance

ByConity adopts mainstream OLAP engine optimizations, such as column storage, vectorized execution, MPP execution, query optimization, etc., ensuring excellent read and write performance.

ByConity's Technical Architecture

ByConity's architecture is divided into three layers:

  1. Service access layer: Responsible for client data and service access, i.e., ByConity Server
  2. Computing group: ByConity's computing resource layer, where each Virtual Warehouse is a computing group
  3. Data storage: Distributed file system, such as HDFS, S3, etc.

Figure 3: ByConity's architecture Figure 3: ByConity's architecture

Working Principle of ByConity

ByConity is a powerful open-source cloud-native data warehouse that adopts a storage-computing separation architecture. In this section, we will examine the working principle of ByConity and the interaction process of each component of ByConity through the complete life cycle of a SQL.

Figure 4: ByConity internal component interaction diagram Figure 4: ByConity internal component interaction diagram

Figure 4 above is the interaction diagram of ByConity components. The dotted line in the figure indicates the inflow of a SQL, the two-way arrow in the solid line indicates the interaction within the component, and the one-way arrow indicates the data processing and output to the client.

ByConity's working principle can be divided into three stages:

Stage 1: Query Request

The client submits a query request to the server, and the server first performs parsing, then analyzes and optimizes through analyzer and optimizer to generate a more efficient executable plan. Here, metadata MetaData is read, which is stored in a distributed KV. ByConity uses FoundationDB and reads the metadata through the Catalog.

Stage 2: Plan Scheduling

ByConity submits the executable plan generated by the analysis and optimizer to the scheduler (Plan Scheduler). The scheduler obtains idle computing resources by accessing the resource manager and decides which nodes to schedule query tasks for execution.

Stage 3: Query Execution

Query requests are finally executed on ByConity's Worker, and the Worker will read data from the lowest-level Cloud storage and perform calculations by establishing a pipeline. Finally, the calculation results of multiple workers are aggregated by the server and returned to the client.

In addition to the above components, ByConity also has two main components, Time-stamp Oracle and Deamon Manager. The former ByConity supports transaction processing, and the latter manages and schedules some subsequent tasks.

Main Component Library

To better understand the working principle of ByConity, let's take a look at the main components of ByConity:

Metadata Management

ByConity provides a highly available and high-performance metadata read and write service - Catalog Server. And ByConity supports complete transaction semantics (ACID). At the same time, we have made a better abstraction of the Catalog Server, making the back-end storage system pluggable. Currently, we support Apple's open-source FoundationDB, which can be expanded to support more back-end storage systems later.

Query Optimizer

The query optimizer is one of the cores of the database system. A good optimizer can greatly improve query performance. ByConity's self-developed optimizer improves optimization capabilities based on two directions:

  • RBO: Rule-Based Optimization capability. Support: column pruning, partition pruning, expression simplification, subquery disassociation, predicate pushdown, redundant operator elimination, outer-JOIN to INNER-JOIN, operator pushdown storage, distributed operator splitting, etc.
  • CBO: Cost-Based Optimization capability. Support: Join Reorder, Outer-Join Reorder, Join/Agg Reorder, CTE, Materialized View, Dynamic Filter Push-Down, Magic Set, and other cost-based optimization capabilities. And integrate Property Enforcement for distributed planning.

Query Scheduling

ByConity currently supports two query scheduling strategies: Cache-aware scheduling and Resource-aware scheduling.

  • The cache-aware scheduling policy is aimed at scenarios where storage and computing are separated, aiming to maximize the use of the cache and avoid cold reads. The cache-aware scheduling strategy will try to schedule tasks to nodes with corresponding data caches, so that calculations can hit the cache and improve read and write performance.
  • Resource-aware scheduling perceives the resource usage of different nodes in the computing group in the entire cluster and performs targeted scheduling to maximize resource utilization. At the same time, it also performs flow control to ensure reasonable use of resources and avoid negative effects caused by overload, such as system downtime.

Computing Group

ByConity supports different tenants to use different computing resources. Under ByConity's new architecture, it is easy to implement features such as multi-tenant isolation and read-write separation. Different tenants can use different computing groups to achieve multi-tenant isolation and support read-write separation. Due to the convenient expansion and contraction, the computing group can be dynamically expanded and contracted on demand to ensure efficient resource utilization. When resource utilization is not high, resource sharing can be carried out, and computing groups can be seconded to other tenants to maximize resource utilization and reduce costs.

Virtual File System

The virtual file system module is used as the middle layer of data reading and writing. ByConity has made a better package, exposing storage as a service to other modules to realize "storage as a service". The virtual file system provides a unified file system abstraction, shields different back-end implementations, facilitates expansion, and supports multiple storage systems, such as HDFS or object storage.

Cache Acceleration

ByConity performs query acceleration through caching. Under the architecture of separating storage and computing, ByConity performs cache acceleration in both metadata and data dimensions. In the metadata dimension, by caching in the memory of ByConity's Server side, table, and partition are used as granularity. In the data dimension, ByConity's Worker side, that is, the computing group, is used for caching, and the cache on the Worker side is hierarchical. At the same time, memory and disk are used, and the mark set is used as the cache granularity, thereby effectively improving the query speed.

How to Obtain and Deploy

ByConity currently supports four acquisition and deployment modes. Community developers are welcome to use them and submit issues to us:

ByConity's Future Open-Source Plan

ByConity includes several key milestones in its open-source community roadmap through 2023. These milestones are designed to enhance ByConity's functionality, performance, and ease of use. Among them, the development of new storage engines, support for more data types, and integration with other data management tools are some important areas of focus. We have listed the following directions, and we have created an issue on Github: https://github.com/ByConity/ByConity/issues/26, inviting community partners to join us to discuss co-construction:

  • In terms of performance improvement: ByConity hopes to continue to improve performance, and here are three technical directions. One is to use indexes for acceleration, which includes four aspects:
    • Optimize the existing skip index;
    • Explore the implementation of new index research, such as zorder-index and inverted index;
    • ByConity builds and accelerates Hive table indexes
    • Index recommendation and conversion, lowering the threshold for users to use The second is the continuous optimization of the query optimizer; the third is that ByConity's cache mechanism is local, and each computing group can only access its own cache. In the future, it is hoped to implement a distributed cache mechanism to further improve the cache hit rate.
  • Stability improvement: There are two aspects here. One is to support resource isolation in more dimensions. Currently, it only supports resource isolation in the computing group dimension. In the next step, resource isolation will also be supported on the server side, providing better end-to-end Guaranteed multi-tenancy capability. The second direction is to enrich metrics and improve observability and problem diagnosis capabilities.
  • Enterprise-level feature enhancements: We hope to achieve more detailed permission control, including column-level permission control. The other is to improve the functions related to data security, such as data backup and recovery and data end-to-end encryption. Finally, we continue to explore the deep compression of data to save storage costs.
  • Ecological compatibility improvement: This direction is the most important point. ByConity plans to support more types of storage backends, such as AWS's S3, Volcano Engine's object storage, etc. In terms of improving ecological compatibility, it includes integration with some drivers and some open source software. At the same time, we also hope to support federated queries of data lakes, such as Hudi, Iceberg, etc.

In short, ByConity is an open source cloud-native data warehouse that provides read-write separation, elastic expansion and contraction, tenant resource isolation, and strong consistency of data read and write. Its storage-computing separation architecture, combined with mainstream OLAP engine optimization, ensures excellent read and write performance. As ByConity continues to develop and improve, it is expected to become an important tool for cloud-native data warehouses in the future.

We have a video that introduces ByConity in detail, including a demo of ByConity. If you need more information, you can check the following link: https://www.bilibili.com/video/BV15k4y1b7pw/?spm_id_from=333.999.0.0&vd_source=71f3be2102fec1a0171b49a530cefad0

Scan the QR code to reply [name + QR code] Join the ByConity communication group to get more project dynamics and activity information.

ByConity Community QR Code ByConity Community QR Code