Version: 1.0.x

Monitoring Cluster

Common Monitoring Metrics

Prometheus Metrics:

The engine exposes monitoring metrics at the HTTP endpoint /metrics by default on port 8123, which can be directly accessed.

You can view the corresponding metric output using kubectl:

kubectl port-forward -n cnch cnch-default-server-0 8123:8123
# Use port-forward functionality to proxy the port

Afterward, you can open localhost:8123/metrics in a web browser to see the displayed metrics as shown in the figure below. Each line corresponds to a specific metric item, following the Prometheus-compliant metric format.

VictoriaMetric Metrics Aggregation:

VictoriaMetric is chosen for metric storage, facilitating horizontal scaling and providing richer functionality. An important feature is VMRule, which allows aggregation of raw metrics. Some of the raw Prometheus metrics emitted by various components can be directly used to build monitoring alerts, while others are more complex and not suitable for directly building dashboards or alerts. These are aggregated using VWRule.

Refer to the file cnch-metrics.yaml for specific configuration rules, and apply the configuration using kubectl:

kubctl apply -f cnch-metrics.yaml # Apply the corresponding rules

Monitoring Service Nodes (Server)

Key Metrics

Metric Name (quoted ones are aggregated by VM)	Description
cnch:latency:queries_cluster:pct95, cnch:latency:queries_cluster:pct99	Query latency at percentiles 95 and 99
cnch:latency:queries_cluster:slow_ratio	Percentage of slow queries taking more than 10s
cnch:profile_events:labelled_query_vw:total_rate5m	Total QPS for all VWs
cnch:profile_events:labelled_query_vw:error_rate5m	Failed QPS for all VWs
cnch_current_metrics_query	Query type labeled as 'insert' for writes

Configuring Grafana Dashboard for Server

Dashboard highlights:

Dashboard Name	Expression	Description
Queries Durations	`cnch:latency:queries_cluster:pct95{namespace="$namespace", cluster="$cluster"}` and `cnch:latency:queries_cluster:pct99{namespace="$namespace", cluster="$cluster"}`	Query latency at P99 and P95 percentiles
Slow Queries > 10s	`cnch:latency:queries_cluster:slow_ratio{namespace="$namespace", cluster="$cluster"}`	Percentage of slow queries taking more than 10s
Queries Per Second	`sum(cnch:profile_events:labelled_query_vw:total_rate5m{namespace="$namespace", cluster="$cluster", workload=~"$workload"})`	Total QPS for all VWs
VW Queries Success	`1 - (sum by (pod) (cnch:profile_events:labelled_query_vw:error_rate5m{cluster="$cluster", namespace="$namespace", workload=~"$workload", pod=~"$pod"}) / sum by (pod) (cnch:profile_events:labelled_query_vw:total_rate5m{cluster="$cluster", namespace="$namespace", workload=~"$workload", pod=~"$pod"}))`	Success rate calculated by subtracting and dividing `error_rate5m` from `total_rate5m`

The complete Grafana configuration file for the Server is available for import in Grafana's UI: cnch-server.json

Monitoring TSO

Key Metrics

The following are important metrics specific to TSO:

Metric Name	Description
cnch:tso:requests:error_rate5m	Failed QPS for the TSO component
cnch:tso:requests:total_rate5m	Total QPS for the TSO component

Monitoring Cluster

Common Monitoring Metrics​

Prometheus Metrics:​

VictoriaMetric Metrics Aggregation:​

Monitoring Service Nodes (Server)

Key Metrics​

Configuring Grafana Dashboard for Server​

Monitoring TSO

Key Metrics​

Configuring Grafana Dashboard for TSO​

Common Monitoring Metrics

Prometheus Metrics:

VictoriaMetric Metrics Aggregation:

Key Metrics

Configuring Grafana Dashboard for Server

Key Metrics

Configuring Grafana Dashboard for TSO