Skip to main content
Version: 0.2.0

Monitor Cluster

Common Monitoring Indicators

Prometheus Monitoring Indicators:

The engine spits out monitoring items under the path of the HTTP interface /metrics, the default port is 8123, and you can directly access the output of the corresponding port.

The corresponding metric output can be viewed through kubectl

kubectl port-forward -n cnch cnch-default-server-0 8123:8123
# use port-forward function to proxy port

After that, you can open localhost:8123/metrics with a browser, and you can view the indicator display as shown in the figure below. Each line corresponds to a specific index item, which conforms to the index format agreed by Prometheus.

VictoriaMetric Metric Aggregation:

Choose VictoriaMetric for the storage of indicators, which is convenient for horizontal expansion of storage and provides richer functions.

An important feature among these is the VMRule, which aggregates raw metrics. Because some of the original Prometheus indicators spit out by each component can be directly used to build monitoring alarms, the other part is more complicated and it is not easy to directly build monitoring dashboards and alarms, so it is aggregated through VWRule. The following is the rule configuration file cnch-metrics.yaml:

# Source: victoria-rules/templates/cnch-metrics.yaml
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: release-name-victoria-rule-cnch-metrics
namespace: cnch-operator-default-system
labels:
app: victoria-rules

chart: victoria-rules-0.1.6
release: "release-name"
heritage: "Helm"
spec:
groups:
- name: CnchMetricsLatency
rules:
# Histogram at VW level
- record: cnch:latency:queries_vw:pct95
expr: |-
histogram_quantile(0.95,
sum by (cluster, namespace, vw, le)(
rate(cnch_histogram_metrics_query_latency_bucket[5m])
)
)
# Histogram at Cluster level
- record: cnch:latency:queries_cluster:pct95
expr: |-
histogram_quantile(0.95,
sum by (cluster, namespace, le)(
rate(cnch_histogram_metrics_query_latency_bucket[5m])
)
)

# Trends Metrics
# Trend Latency VW level
- record: cnch:latency:queries_vw:pct95:avg_1d
expr: avg_over_time(cnch:latency:queries_vw:pct95[1d])

# Trend Latency Cluster level
- record: cnch:latency:queries_cluster:pct95:avg_1d
expr: avg_over_time(cnch:latency:queries_cluster:pct95[1d])
# Histogram at VW level
- record: cnch:latency:queries_vw:pct99
expr: |-
histogram_quantile(0.99,
sum by (cluster, namespace, vw, le)(
rate(cnch_histogram_metrics_query_latency_bucket[5m])
)
)
# Histogram at Cluster level
- record: cnch:latency:queries_cluster:pct99
expr: |-
histogram_quantile(0.99,
sum by (cluster, namespace, le)(
rate(cnch_histogram_metrics_query_latency_bucket[5m])
)
)

# Trends Metrics
# Trend Latency VW level
- record: cnch:latency:queries_vw:pct99:avg_1d
expr: avg_over_time(cnch:latency:queries_vw:pct99[1d])

# Trend Latency Cluster level
- record: cnch:latency:queries_cluster:pct99:avg_1d
expr: avg_over_time(cnch:latency:queries_cluster:pct99[1d])

# Trend Slow Q VW level
- record: cnch:latency:queries_vw:slow_ratio:avg_1d
expr: avg_over_time(cnch:latency:queries_vw:slow_ratio[1d])

# Trend Slow Q Cluster level
- record: cnch:latency:queries_cluster:slow_ratio:avg_1d
expr: avg_over_time(cnch:latency:queries_cluster:slow_ratio[1d])

# Slow Q VW level (Percentage of query > 10s)
- record: cnch:latency:queries_vw:slow_ratio
expr: |-
sum by (cluster, namespace, vw)(
rate(cnch_histogram_metrics_query_latency_count[5m])
- on (namespace, pod, cluster, vw, instance) rate(cnch_histogram_metrics_query_latency_bucket{le="10000"}[5m])
)
/
sum by (cluster, namespace, vw)(
rate(cnch_histogram_metrics_query_latency_count[5m])
)

# Slow Q Cluster level (Percentage of query > 10s)
- record: cnch:latency:queries_cluster:slow_ratio
expr: |-
sum by (cluster, namespace)(
rate(cnch_histogram_metrics_query_latency_count[5m])
- on (namespace, pod, cluster, vw, instance) rate(cnch_histogram_metrics_query_latency_bucket{le="10000"}[5m])
)
/
sum by (cluster, namespace)(
rate(cnch_histogram_metrics_query_latency_count[5m])
)

# Slow Q Cluster level (count queries > 10s) used by OP portal
- record: cnch:latency:queries_cluster:slow_count
expr: |-
sum by (cluster, namespace)(
increase(cnch_histogram_metrics_query_latency_count[1h])
- on (namespace, pod, cluster, vw, instance) increase(cnch_histogram_metrics_query_latency_bucket{le="10000"}[1h])
)

# Todo check if this metric became server only
- record: cnch:latency:queries_timeout:rate5m
expr: |-
sum by (cluster, namespace, pod, workload) (
rate(cnch_profile_events_timed_out_query_total[5m])
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel
)

- name: CnchMetricsQPS
rules:
# Trend WG workload level
# - record: cnch:profile_events:query:total_rate5m:avg_1d
# expr: avg_over_time(sum by (cluster, namespace, workload, type) (cnch:profile_events:query:total_rate5m)[1d])

# Trend VW QPS VW level. server POV only
- record: cnch:profile_events:labelled_query_vw:total_rate5m:avg_1d
expr: avg_over_time(sum by (cluster, namespace, vw, query_type) (cnch:profile_events:labelled_query_vw:total_rate5m)[1d])

# VW QPS cluster level Todo use sum(avg_1d{vw != ""}) if no similar reenable this Trend
- record: cnch:profile_events:labelled_query_cluster:total_rate5m:avg_1d
expr: |-
avg_over_time(sum by (cluster, namespace, query_type) (cnch:profile_events:labelled_query_vw:total_rate5m)[1d])

# Trend VW Error Ratio VW level (can't sum burnrate % so we pre-recorded a burnrate summed at vw level)
- record: cnch:profile_events:labelled_query_vw_sum:error_burnrate5m:avg_1d
expr: |-
avg_over_time(cnch:profile_events:labelled_query_vw_sum:error_burnrate5m[1d])

# Number of workers in a WG that use more than 80% memory
- record: cnch:workers:high_mem_rss:80pct_count
expr: |-
(
count(
(sum(
container_memory_rss{container!="", image!=""}
* on(namespace,pod)
group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{workload=~"cnch.*worker.*|vw.*"}
) by (pod, namespace, workload)
/ sum(
kube_pod_container_resource_limits{resource="memory"}
* on(namespace,pod)
group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{workload=~"cnch.*worker.*|vw.*"}
) by (pod, namespace, workload)) > 0.80
) by (namespace, workload)
/
count(namespace_workload_pod:kube_pod_owner:relabel{workload=~"cnch.*worker.*|vw.*"}) by (namespace, workload)
)

# Byteyard Usage Profiler metrics
- record: cnch:vw:metrics:running_queries:time_milliseconds_total
expr: sum by (vw_id, cluster) (increase(cnch_internal_metrics_running_queries_time_milliseconds_total[30s]))
- record: cnch:vw:metrics:queued_queries:time_milliseconds_total
expr: sum by (vw_id, cluster) (increase(cnch_internal_metrics_queued_queries_time_milliseconds_total[30s]))

# Query Error Ratio over multiple intervals aka burn rate

# Record server POV, for vw only, the unlimited are only used for few dashboard and 1 alert rule
# Worker POV is used in workers dashboard only
- record: cnch:profile_events:labelled_query_vw:total_rate5m
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="vw"}[5m])) by (pod, cluster, namespace, query_type, vw, wg)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload=~".*server.*"}

# Record workers POV, used by Byteyard autosuspend (server pov might not have direct insert) and workers graph
- record: cnch:profile_events:labelled_query_vw_workers:total_rate5m
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="vw"}[5m])) by (pod, cluster, namespace)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload!~".*server.*"}
# TEMP until byteyard support cnch:profile_events:labelled_query_vw_workers
# TODO remove this
- record: cnch:profile_event:queries_vw_only:total_rate5m
expr: |-
cnch:profile_events:labelled_query_vw_workers:total_rate5m

- record: cnch:tso:requests:total_rate5m
expr: |-
sum(rate(cnch_profile_events_tso_request_total[5m])) by (pod, cluster, namespace)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

# Err/s default to 0 if a request total exist (e.g. only success request) so it's included in availability
- record: cnch:profile_events:labelled_query_vw:error_rate5m
expr: |
((
sum(rate(cnch_profile_events_queries_failed_total{failure_type!="QueriesFailedFromUser", resource_type="vw"}[5m])) by (pod, cluster, namespace, query_type, vw, wg)
)
or
(
0 * group by (pod, cluster, namespace, resource_type, query_type, vw, wg) (cnch:profile_events:labelled_query_vw:total_rate5m)
))
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload=~".*server.*"}
- record: cnch:tso:requests:error_rate5m
expr: |
((
sum(rate(cnch_profile_events_tso_error_total[5m])) by (pod, cluster, namespace)
)
or
(
0 * group by (pod, cluster, namespace) (cnch:tso:requests:total_rate5m)
))
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

# Use WG level precision
- record: cnch:profile_events:labelled_query_vw:error_burnrate5m
expr: |
sum(cnch:profile_events:labelled_query_vw:error_rate5m) by (workload, cluster, namespace, vw, wg)
/
sum(cnch:profile_events:labelled_query_vw:total_rate5m) by (workload, cluster, namespace, vw, wg)

- record: cnch:tso:requests:error_burnrate5m
expr: |
sum(cnch:tso:requests:error_rate5m) by (workload, cluster, namespace)
/
sum(cnch:tso:requests:total_rate5m) by (workload, cluster, namespace)

# Record server POV, for vw only, the unlimited are only used for few dashboard and 1 alert rule
# Worker POV is used in workers dashboard only
- record: cnch:profile_events:labelled_query_vw:total_rate30m
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="vw"}[30m])) by (pod, cluster, namespace, query_type, vw, wg)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload=~".*server.*"}

# Record workers POV, used by Byteyard autosuspend (server pov might not have direct insert) and workers graph
- record: cnch:profile_events:labelled_query_vw_workers:total_rate30m
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="vw"}[30m])) by (pod, cluster, namespace)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload!~".*server.*"}
# TEMP until byteyard support cnch:profile_events:labelled_query_vw_workers
# TODO remove this
- record: cnch:profile_event:queries_vw_only:total_rate30m
expr: |-
cnch:profile_events:labelled_query_vw_workers:total_rate30m

- record: cnch:tso:requests:total_rate30m
expr: |-
sum(rate(cnch_profile_events_tso_request_total[30m])) by (pod, cluster, namespace)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

# Err/s default to 0 if a request total exist (e.g. only success request) so it's included in availability
- record: cnch:profile_events:labelled_query_vw:error_rate30m
expr: |
((
sum(rate(cnch_profile_events_queries_failed_total{failure_type!="QueriesFailedFromUser", resource_type="vw"}[30m])) by (pod, cluster, namespace, query_type, vw, wg)
)
or
(
0 * group by (pod, cluster, namespace, resource_type, query_type, vw, wg) (cnch:profile_events:labelled_query_vw:total_rate30m)
))
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload=~".*server.*"}
- record: cnch:tso:requests:error_rate30m
expr: |
((
sum(rate(cnch_profile_events_tso_error_total[30m])) by (pod, cluster, namespace)
)
or
(
0 * group by (pod, cluster, namespace) (cnch:tso:requests:total_rate30m)
))
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

# Use WG level precision
- record: cnch:profile_events:labelled_query_vw:error_burnrate30m
expr: |
sum(cnch:profile_events:labelled_query_vw:error_rate30m) by (workload, cluster, namespace, vw, wg)
/
sum(cnch:profile_events:labelled_query_vw:total_rate30m) by (workload, cluster, namespace, vw, wg)

- record: cnch:tso:requests:error_burnrate30m
expr: |
sum(cnch:tso:requests:error_rate30m) by (workload, cluster, namespace)
/
sum(cnch:tso:requests:total_rate30m) by (workload, cluster, namespace)

# Record server POV, for vw only, the unlimited are only used for few dashboard and 1 alert rule
# Worker POV is used in workers dashboard only
- record: cnch:profile_events:labelled_query_vw:total_rate1h
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="vw"}[1h])) by (pod, cluster, namespace, query_type, vw, wg)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload=~".*server.*"}

# Record workers POV, used by Byteyard autosuspend (server pov might not have direct insert) and workers graph
- record: cnch:profile_events:labelled_query_vw_workers:total_rate1h
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="vw"}[1h])) by (pod, cluster, namespace)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload!~".*server.*"}
# TEMP until byteyard support cnch:profile_events:labelled_query_vw_workers
# TODO remove this
- record: cnch:profile_event:queries_vw_only:total_rate1h
expr: |-
cnch:profile_events:labelled_query_vw_workers:total_rate1h

- record: cnch:tso:requests:total_rate1h
expr: |-
sum(rate(cnch_profile_events_tso_request_total[1h])) by (pod, cluster, namespace)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

# Err/s default to 0 if a request total exist (e.g. only success request) so it's included in availability
- record: cnch:profile_events:labelled_query_vw:error_rate1h
expr: |
((
sum(rate(cnch_profile_events_queries_failed_total{failure_type!="QueriesFailedFromUser", resource_type="vw"}[1h])) by (pod, cluster, namespace, query_type, vw, wg)
)
or
(
0 * group by (pod, cluster, namespace, resource_type, query_type, vw, wg) (cnch:profile_events:labelled_query_vw:total_rate1h)
))
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload=~".*server.*"}
- record: cnch:tso:requests:error_rate1h
expr: |
((
sum(rate(cnch_profile_events_tso_error_total[1h])) by (pod, cluster, namespace)
)
or
(
0 * group by (pod, cluster, namespace) (cnch:tso:requests:total_rate1h)
))
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

# Use WG level precision
- record: cnch:profile_events:labelled_query_vw:error_burnrate1h
expr: |
sum(cnch:profile_events:labelled_query_vw:error_rate1h) by (workload, cluster, namespace, vw, wg)
/
sum(cnch:profile_events:labelled_query_vw:total_rate1h) by (workload, cluster, namespace, vw, wg)

- record: cnch:tso:requests:error_burnrate1h
expr: |
sum(cnch:tso:requests:error_rate1h) by (workload, cluster, namespace)
/
sum(cnch:tso:requests:total_rate1h) by (workload, cluster, namespace)

# Record server POV, for vw only, the unlimited are only used for few dashboard and 1 alert rule
# Worker POV is used in workers dashboard only
- record: cnch:profile_events:labelled_query_vw:total_rate6h
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="vw"}[6h])) by (pod, cluster, namespace, query_type, vw, wg)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload=~".*server.*"}

# Record workers POV, used by Byteyard autosuspend (server pov might not have direct insert) and workers graph
- record: cnch:profile_events:labelled_query_vw_workers:total_rate6h
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="vw"}[6h])) by (pod, cluster, namespace)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload!~".*server.*"}
# TEMP until byteyard support cnch:profile_events:labelled_query_vw_workers
# TODO remove this
- record: cnch:profile_event:queries_vw_only:total_rate6h
expr: |-
cnch:profile_events:labelled_query_vw_workers:total_rate6h

- record: cnch:tso:requests:total_rate6h
expr: |-
sum(rate(cnch_profile_events_tso_request_total[6h])) by (pod, cluster, namespace)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

# Err/s default to 0 if a request total exist (e.g. only success request) so it's included in availability
- record: cnch:profile_events:labelled_query_vw:error_rate6h
expr: |
((
sum(rate(cnch_profile_events_queries_failed_total{failure_type!="QueriesFailedFromUser", resource_type="vw"}[6h])) by (pod, cluster, namespace, query_type, vw, wg)
)
or
(
0 * group by (pod, cluster, namespace, resource_type, query_type, vw, wg) (cnch:profile_events:labelled_query_vw:total_rate6h)
))
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload=~".*server.*"}
- record: cnch:tso:requests:error_rate6h
expr: |
((
sum(rate(cnch_profile_events_tso_error_total[6h])) by (pod, cluster, namespace)
)
or
(
0 * group by (pod, cluster, namespace) (cnch:tso:requests:total_rate6h)
))
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

# Use WG level precision
- record: cnch:profile_events:labelled_query_vw:error_burnrate6h
expr: |
sum(cnch:profile_events:labelled_query_vw:error_rate6h) by (workload, cluster, namespace, vw, wg)
/
sum(cnch:profile_events:labelled_query_vw:total_rate6h) by (workload, cluster, namespace, vw, wg)

- record: cnch:tso:requests:error_burnrate6h
expr: |
sum(cnch:tso:requests:error_rate6h) by (workload, cluster, namespace)
/
sum(cnch:tso:requests:total_rate6h) by (workload, cluster, namespace)

# Record server POV, for vw only, the unlimited are only used for few dashboard and 1 alert rule
# Worker POV is used in workers dashboard only
- record: cnch:profile_events:labelled_query_vw:total_rate3d
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="vw"}[3d])) by (pod, cluster, namespace, query_type, vw, wg)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload=~".*server.*"}

# Record workers POV, used by Byteyard autosuspend (server pov might not have direct insert) and workers graph
- record: cnch:profile_events:labelled_query_vw_workers:total_rate3d
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="vw"}[3d])) by (pod, cluster, namespace)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload!~".*server.*"}
# TEMP until byteyard support cnch:profile_events:labelled_query_vw_workers
# TODO remove this
- record: cnch:profile_event:queries_vw_only:total_rate3d
expr: |-
cnch:profile_events:labelled_query_vw_workers:total_rate3d

- record: cnch:tso:requests:total_rate3d
expr: |-
sum(rate(cnch_profile_events_tso_request_total[3d])) by (pod, cluster, namespace)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

# Err/s default to 0 if a request total exist (e.g. only success request) so it's included in availability
- record: cnch:profile_events:labelled_query_vw:error_rate3d
expr: |
((
sum(rate(cnch_profile_events_queries_failed_total{failure_type!="QueriesFailedFromUser", resource_type="vw"}[3d])) by (pod, cluster, namespace, query_type, vw, wg)
)
or
(
0 * group by (pod, cluster, namespace, resource_type, query_type, vw, wg) (cnch:profile_events:labelled_query_vw:total_rate3d)
))
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel{workload=~".*server.*"}
- record: cnch:tso:requests:error_rate3d
expr: |
((
sum(rate(cnch_profile_events_tso_error_total[3d])) by (pod, cluster, namespace)
)
or
(
0 * group by (pod, cluster, namespace) (cnch:tso:requests:total_rate3d)
))
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

# Use WG level precision
- record: cnch:profile_events:labelled_query_vw:error_burnrate3d
expr: |
sum(cnch:profile_events:labelled_query_vw:error_rate3d) by (workload, cluster, namespace, vw, wg)
/
sum(cnch:profile_events:labelled_query_vw:total_rate3d) by (workload, cluster, namespace, vw, wg)

- record: cnch:tso:requests:error_burnrate3d
expr: |
sum(cnch:tso:requests:error_rate3d) by (workload, cluster, namespace)
/
sum(cnch:tso:requests:total_rate3d) by (workload, cluster, namespace)

# Use VW level precision only 5m timeframe used for dashboard only (trend avg_1d)
- record: cnch:profile_events:labelled_query_vw_sum:error_burnrate5m
expr: |
sum(cnch:profile_events:labelled_query_vw:error_rate5m) by (cluster, namespace, vw)
/
sum(cnch:profile_events:labelled_query_vw:total_rate5m) by (cluster, namespace, vw)

# Only used for few dashboard and 1 error alert rule, no need burn rate worker+ server POV
- record: cnch:profile_events:labelled_query_unlimited:total_rate5m
expr: |-
sum(rate(cnch_profile_events_labelled_query_total{resource_type="unlimited"}[5m])) by (pod, cluster, namespace, query_type)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

- record: cnch:profile_events:labelled_query_unlimited:error_rate5m
expr: |-
sum(rate(cnch_profile_events_queries_failed_total{failure_type!="QueriesFailedFromUser", resource_type="unlimited"}[5m])) by (pod, cluster, namespace, query_type, vw, wg)
* on (pod, namespace) group_left(workload) namespace_workload_pod:kube_pod_owner:relabel

- name: CnchMetricsAvailability
rules:
- record: cnch:wg:availability
labels:
slo: error_rate
# 1 = available, 0 = unavailable
# min() check if any of the burn rate is firing (1, 1, 0) -> 0
# For any burn rate, both time window must be triggered (Multiwindow) so we use max() (1, 0) -> 1 avail
# TODO maybe change this with a ALERTS{alertstate="firing",severity="critical", alertname=~".*BudgetBurn"}
# As we can't have the 'for 15m' here see: https://github.com/metalmatze/slo-libsonnet/issues/52
expr: |
min by (cluster, namespace, vw, wg) (
max by (cluster, namespace, vw, wg) (
cnch:profile_events:labelled_query_vw:error_burnrate5m{vw=~".*"} <= bool (14.40 * (1 - 0.99)),
cnch:profile_events:labelled_query_vw:error_burnrate1h{vw=~".*"} <= bool (14.40 * (1 - 0.99))
),
max by (cluster, namespace, vw, wg) (
cnch:profile_events:labelled_query_vw:error_burnrate30m{vw=~".*"} <= bool (6.00 * (1 - 0.99)),
cnch:profile_events:labelled_query_vw:error_burnrate6h{vw=~".*"} <= bool (6.00 * (1 - 0.99))
),
max by (cluster, namespace, vw, wg) (
cnch:profile_events:labelled_query_vw:error_burnrate6h{vw=~".*"} <= bool (1.00 * (1 - 0.99)),
cnch:profile_events:labelled_query_vw:error_burnrate3d{vw=~".*"} <= bool (1.00 * (1 - 0.99))
)
)
- record: cnch:cluster:availability
expr: |
1 - (sum by (cluster, namespace) (cnch:profile_events:labelled_query_vw:error_rate5m)
/
sum by (cluster, namespace) (cnch:profile_events:labelled_query_vw:total_rate5m))

Use kubectl to execute the configuration to take effect:

kubctl apply -f cnch-metrics.yaml # Configure the corresponding rule

Monitoring service node (Server)

Important indicators

The more important indicators are excerpted below for explanation

Metric name (the ones in double quotes are VM-aggregated))Description
cnch:latency:queries_cluster:pct95 cnch:latency:queries_cluster:pct99Query latency pct99 and pct55
cnch:latency:queries_cluster:slow_ratioThe proportion of slow queries longer than 10s
cnch:profile_events:labelled_query_vw:total_rate5mTotal QPS of all VW
cnch:profile_events:labelled_query_vw:error_rate5mFailed QPS for all VWs
cnch_current_metrics_queryThe value of the label name query_type is insert, which is the written query

Configure Grafana Kanban for the service node (Server)

Kanban content see screenshot

Important indicators:

Kanban NameExpressionDescription
Queries Ducationscnch:latency:queries_cluster:pct95{namespace="$namespace", cluster="$cluster"}和cnch:latency:queries_cluster:pct99{namespace="$namespace", cluster="$cluster"}Query latency P99 和 P95
Slow Queries > 10scnch:latency:queries_cluster:slow_ratio{namespace="$namespace", cluster="$cluster"}The proportion of slow queries longer than 10s
Queries Per Secondsum(cnch:profile_events:labelled_query_vw:total_rate5m{namespace="$namespace", cluster="$cluster", workload=~"$workload"})Total QPS for all VWs
VW Queries Success1 - (sum by (pod) (cnch:profile_events:labelled_query_vw:error_rate5m{cluster="$cluster", namespace="$namespace", workload=~"$workload", pod=~"$pod"}) sum by (pod) (cnch:profile_events:labelled_query_vw:total_rate5m{cluster="$cluster", namespace="$namespace", workload=~"$workload", pod=~"$pod"}))After subtracting and dividing error_rate5m and total_rate5m, the success rate is obtained

The complete Grafana configuration file of the Server is as follows, which can be imported in the Grafana UI:cnch-server.json

Monitor TSO

important indicators

The following excerpts are some important indicators for TSO to illustrate:

IndicatorsDescription
cnch:tso:requests:error_rate5mFailed QPS for TSO components
cnch:tso:requests:total_rate5mTotal QPS for TSO components

Configure Grafana Kanban for TSO

The screenshot of the board is as follows:

Important indicators:

Kanban NameExpressionDescription
TSO Server Requests Per Seccnch:tso:requests:total_rate5m{namespace="$namespace", cluster="$cluster", workload=~".*server.*"}QPS of Server component for TSO query
TSO Worker Requests Per Seccnch:tso:requests:total_rate5m{namespace="$namespace", cluster="$cluster", workload!~".*(server\|**kafka**).*"}Remove the server and kafka, and only look at the request QPS of each worker for TSO
TSO Servers Requests Server Ratecnch:tso:requests:error_rate5m{namespace="$namespace", cluster="$cluster", workload=~".*server.*"} cnch:tso:requests:total_rate5m{namespace="$namespace", cluster="$cluster", workload=~".*server.*"}Divide error_rate and total_rate to filter the failure rate of TSO query

The complete configuration file of TSO is as follows:cnch-tso.json

Other information that can be monitored

Other commonly used monitoring board configurations are listed here, no more screenshots

Cluster Overview: Overview of the entire cluster cnch-cluster.json

VW: detials of each Virtual Warehouse cnch-vw.json

DaemonManager: Components that manages background tasks such as Merge cnch-daemonmanager.json