Key Metrics

Learn some key metrics displayed on the Grafana Overview dashboard.

We are currently refactoring our documentation. Please excuse any problems you may find and report them here.

If your TiKV cluster is deployed using Ansible or Docker Compose, the monitoring system is deployed at the same time. For more details, see Overview of the TiKV Monitoring Framework.

The Grafana dashboard is divided into a series of sub-dashboards which include Overview, PD, TiKV, and so on. You can use various metrics to help you diagnose the cluster.

For routine operations, you can get an overview of the component (PD, TiKV) status and the entire cluster from the Overview dashboard, where the key metrics are displayed. This document provides a detailed description of these key metrics.

Key metrics description

To understand the key metrics displayed on the Overview dashboard, check the following table:

ServicePanel NameDescriptionNormal Range
Services Port StatusServices Onlinethe online nodes number of each service
Services Port StatusServices Offlinethe offline nodes number of each service
PDStorage Capacitythe total storage capacity of the TiKV cluster
PDCurrent Storage Sizethe occupied storage capacity of the TiKV cluster
PDNumber of Regionsthe total number of Regions of the current cluster
PDLeader Balance Ratiothe leader ratio difference of the nodes with the biggest leader ratio and the smallest leader ratioIt is less than 5% for a balanced situation and becomes bigger when you restart a node.
PDRegion Balance Ratiothe region ratio difference of the nodes with the biggest Region ratio and the smallest Region ratioIt is less than 5% for a balanced situation and becomes bigger when you add or remove a node.
PDStore Status – Up Storesthe number of TiKV nodes that are up
PDStore Status – Disconnect Storesthe number of TiKV nodes that encounter abnormal communication within a short time
PDStore Status – LowSpace Storesthe number of TiKV nodes with an available space of less than 80%
PDStore Status – Down Storesthe number of TiKV nodes that are downThe normal value is 0. If the number is bigger than 0, it means some node(s) are abnormal.
PDStore Status – Offline Storesthe number of TiKV nodes (still providing service) that are being made offline
PDStore Status – Tombstone Storesthe number of TiKV nodes that are successfully offline
PD99% completed_cmds_duration_secondsthe 99th percentile duration to complete a pd-server requestless than 5ms
PDhandle_requests_duration_secondsthe request duration of a PD request
TiKVleaderthe number of leaders on each TiKV node
TiKVregionthe number of Regions on each TiKV node
TiKVCPUthe CPU usage ratio on each TiKV node
TiKVMemorythe memory usage on each TiKV node
TiKVstore sizethe data amount on each TiKV node
TiKVcf sizethe data amount on different CFs in the cluster
TiKVchannel fullNo data points is displayed in normal conditions. If a monitoring value displays, it means the corresponding TiKV node fails to handle the messages
TiKVserver report failuresNo data points is displayed in normal conditions. If Unreachable is displayed, it means TiKV encounters a communication issue.
TiKVscheduler pending commandsthe number of commits on queueOccasional value peaks are normal.
TiKVcoprocessor pending requeststhe number of requests on queue0 or very small
TiKVcoprocessor executor countthe number of various query operations
TiKVcoprocessor request durationthe time consumed by TiKV queries
TiKVraft store CPUthe CPU usage ratio of the raftstore threadCurrently, it is a single thread. A value of over 80% indicates that the CPU usage ratio is very high.
TiKVCoprocessor CPUthe CPU usage ratio of the TiKV query thread, related to the application; complex queries consume a great deal of CPU
System InfoVcoresthe number of CPU cores
System InfoMemorythe total memory
System InfoCPU Usagethe CPU usage ratio, 100% at a maximum
System InfoLoad [1m]the overload within 1 minute
System InfoMemory Availablethe size of the available memory
System InfoNetwork Trafficthe statistics of the network traffic
System InfoTCP Retransthe statistics about network monitoring and TCP
System InfoIO Utilthe disk usage ratio, 100% at a maximum; generally you need to consider adding a new node when the usage ratio is up to 80% ~ 90%