Key Metrics

Learn some key metrics displayed on the Grafana Overview dashboard.

We are currently refactoring our documentation. Please excuse any problems you may find and report them here.

If your TiKV cluster is deployed using Ansible or Docker Compose, the monitoring system is deployed at the same time. For more details, see Overview of the TiKV Monitoring Framework.

The Grafana dashboard is divided into a series of sub-dashboards which include Overview, PD, TiKV, and so on. You can use various metrics to help you diagnose the cluster.

But you can also deploy your own Grafana server to monitor the TiKV cluster, especially when you are going to use TiKV without TiDB. This document provides a detailed description of key metrics so that you can monitor the Prometheus metrics you are interested in.

Key metrics description

To understand the key metrics, check the following table:

ServiceMetric NameDescriptionNormal Range
Clustertikv_store_size_bytesThe size of storage. The metric has a type label (eg: “capacity”, “available”).
gRPCtikv_grpc_msg_duration_secondsBucketed histogram of gRPC server messages. The metric has a type label which represents the type of the server message. You can count the metric and calculate the QPS.
gRPCtikv_grpc_msg_fail_totalThe total number of gRPC message handling failure. The metric has a type label which represents gRPC message type.
gRPCgrpc batch size of gRPC requestsgrpc batch size of gRPC requests.
Schedulertikv_scheduler_too_busy_totalThe total count of too busy schedulers. The metric has a type label which represents the scheduler type.
Schedulertikv_scheduler_contex_totalThe total number of pending commands. The scheduler receives commands from clients, executes them against the MVCC layer storage engine.
Schedulertikv_scheduler_stage_totalTotal number of commands on each stage. The metric has two labels: type and stage. stage represents the stage of executed commands like “read_finish”, “async_snapshot_err”, “snapshot”, etc.
Schedulertikv_scheduler_commands_pri_totalTotal count of different priority commands. The metric has a priority label.
Servertikv_server_grpc_resp_batch_sizegrpc batch size of gRPC responses.
Servertikv_server_report_failure_msg_totalTotal number of reporting failure messages. The metric has two labels: type and store_id. type represents the failure type, and store_id represents the destination peer store id.
Servertikv_server_raft_message_flush_totalTotal number of raft messages flushed immediately.
Servertikv_server_raft_message_recv_totalTotal number of raft messages received.
Servertikv_region_written_keysHistogram of written keys for regions.
Servertikv_server_send_snapshot_duration_secondsBucketed histogram of duration in which the server sends snapshots.
Servertikv_region_written_bytesHistogram of bytes written for regions.
Rafttikv_raftstore_leader_missingTotal number of leader missed regions.
Rafttikv_raftstore_region_countThe number of regions collected in each TiKV node. The label type has region and leader. region represents regions collected, and leader represents the number of leaders in each TiKV node.
Rafttikv_raftstore_region_sizeBucketed histogram of approximate region size.
Rafttikv_raftstore_apply_log_duration_secondsBucketed histogram of the duration in which each peer applies log.
Rafttikv_raftstore_commit_log_duration_secondsBucketed histogram of the duration in which each peer commits logs.
Rafttikv_raftstore_raft_ready_handled_totalTotal number of Raft ready handled. The metric has a label type.
Rafttikv_raftstore_raft_process_duration_secsBucketed histogram of duration in which each peer processes Raft. The metric has a label type.
Rafttikv_raftstore_event_durationDuration of raft store events. The metric has a label type.
Rafttikv_raftstore_raft_sent_message_totalTotal number of messages sent by Raft ready. The metric has a label type.
Rafttikv_raftstore_raft_dropped_message_totalTotal number of messages dropped by Raft. The metric has a label type.
Rafttikv_raftstore_apply_proposalThe count of proposals sent by a region at once.
Rafttikv_raftstore_proposal_totalTotal number of proposals made. The metric has a label type.
Rafttikv_raftstore_request_wait_time_duration_secsBucketed histogram of request wait time duration.
Rafttikv_raftstore_propose_log_sizeBucketed histogram of the size of each peer proposing log.
Rafttikv_raftstore_apply_wait_time_duration_secsBucketed histogram of apply task wait time duration.
Rafttikv_raftstore_admin_cmd_totalTotal number of admin command processed. The metric has 2 labels type and status.
Rafttikv_raftstore_check_split_totalTotal number of raftstore split check. The metric has a label type.
Rafttikv_raftstore_check_split_duration_secondsBucketed histogram of duration for the raftstore split check.
Rafttikv_raftstore_local_read_reject_totalTotal number of rejections from the local reader. The metric has a label reason which represents the rejection reason.
Rafttikv_raftstore_snapshot_duration_secondsBucketed histogram of raftstore snapshot process duration. The metric has a label type.
Rafttikv_raftstore_snapshot_traffic_totalThe total amount of raftstore snapshot traffic. The metric has a label type.
Rafttikv_raftstore_local_read_executed_requestsTotal number of requests directly executed by local reader.
Coprocessortikv_coprocessor_request_duration_secondsBucketed histogram of coprocessor request duration. The metric has a label req.
Coprocessortikv_coprocessor_request_errorTotal number of push down request error. The metric has a label reason.
Coprocessortikv_coprocessor_scan_keysBucketed histogram of scan keys observed per request. The metric has a label req which represents the tag of requests.
Coprocessortikv_coprocessor_rocksdb_perfTotal number of RocksDB internal operations from PerfContext. The metric has 2 labels req and metric. req represents the tag of requests and metric is performance metric like “block_cache_hit_count”, “block_read_count”, “encrypt_data_nanos”, etc.
Coprocessortikv_coprocessor_executor_countThe number of various query operations. The metric has a single label type which represents the related query operation (e.g., “limit”, “top_n”, and “batch_table_scan”).
Coprocessortikv_coprocessor_response_bytesTotal bytes of response body.
Storagetikv_storage_mvcc_versionsHistogram of versions for each key.
Storagetikv_storage_mvcc_gc_delete_versionsHistogram of versions deleted by GC for each key.
Storagetikv_storage_mvcc_conflict_counterTotal number of conflict error. The metric has a label type.
Storagetikv_storage_mvcc_duplicate_cmd_counterTotal number of duplicated commands. The metric has a label type.
Storagetikv_storage_mvcc_check_txn_statusCounter of different results of check_txn_status. The metric has a label type.
Storagetikv_storage_command_totalTotal number of commands received. The metric has a label type.
Storagetikv_storage_engine_async_request_duration_secondsBucketed histogram of processing successful asynchronous requests. The metric has a label type.
Storagetikv_storage_engine_async_request_totalTotal number of engine asynchronous requests. The metric has 2 labels type and status.
GCtikv_gcworker_gc_task_fail_vecCounter of failed GC tasks. The metric has a label task.
GCtikv_gcworker_gc_task_duration_vecDuration of GC tasks execution. The metric has a label task.
GCtikv_gcworker_gc_keysCounter of keys affected during GC. The metric has two labels cf and tag.
GCtikv_gcworker_autogc_processed_regionsProcessed regions by auto GC. The metric has a label type.
GCtikv_gcworker_autogc_safe_pointSafe point used for auto GC. The metric has a label type.
Snapshottikv_snapshot_sizeSize of snapshot.
Snapshottikv_snapshot_kv_countTotal number of KVs in the snapshot
Snapshottikv_worker_handled_task_totalTotal number of tasks handled by the worker. The metric has a label name.
Snapshottikv_worker_pending_task_totalThe number of tasks currently running by the worker or pending. The metric has a label name.
Snapshottikv_futurepool_handled_task_totalThe total number of tasks handled by future_pool. The metric has a label name.
Snapshottikv_snapshot_ingest_sst_duration_secondsBucketed histogram of RocksDB ingestion durations
Snapshottikv_futurepool_pending_task_totalCurrent future_pool pending + running tasks. The metric has a label name.
RocksDBtikv_engine_get_servedqueries served by engine. The metric has 2 labels db and type.
RocksDBtikv_engine_write_stallHistogram of write stall. The metric has 2 labels db and type.
RocksDBtikv_engine_size_bytesSizes of each column families. The metric has two labels: db and type. db represents which database is being counted (e.g., “kv”, “raft”), and type represents the type of column families (e.g., “default”, “lock”, “raft”, “write”).
RocksDBtikv_engine_flow_bytesBytes and keys of read/write. The metric has type label (eg: “capacity”, “available”).
RocksDBtikv_engine_wal_file_syncedThe number of times WAL sync is done. The metric has a label db.
RocksDBtikv_engine_get_micro_secondsHistogram of time used to get micros. The metric has two labels: db and type.
RocksDBtikv_engine_locateThe number of calls to seek/next/prev. The metric has 2 labels db and type.
RocksDBtikv_engine_seek_micro_secondsHistogram of seek micros. The metric has 2 labels db and type.
RocksDBtikv_engine_write_servedWrite queries served by engine. The metric has 2 labels db and type.
RocksDBtikv_engine_write_micro_secondsHistogram of write micros. The metric has 2 labels db and type.
RocksDBtikv_engine_write_wal_time_micro_secondsHistogram of duration for write WAL micros. The metric has 2 labels db and type.
RocksDBtikv_engine_event_totalNumber of engine events. The metric has 3 labels db, cf and type.
RocksDBtikv_engine_wal_file_sync_micro_secondsHistogram of WAL file sync micros. The metric has 2 labels db and type.
RocksDBtikv_engine_sst_read_microsHistogram of SST read micros. The metric has 2 labels db and type.
RocksDBtikv_engine_compaction_timeHistogram of compaction time. The metric has 2 labels db and type.
RocksDBtikv_engine_block_cache_size_bytesUsage of each column families’ block cache. The metric has 2 labels db and cf.
RocksDBtikv_engine_compaction_reasonThe number of compaction reasons. The metric has 3 labels db, cf and reason.
RocksDBtikv_engine_cache_efficiencyEfficiency of RocksDB’s block cache. The metric has 2 labels db and type.
RocksDBtikv_engine_memtable_efficiencyHit and miss of memtable. The metric has 2 labels db and type.
RocksDBtikv_engine_bloom_efficiencyEfficiency of RocksDB’s bloom filter. The metric has 2 labels db and type.
RocksDBtikv_engine_estimate_num_keysEstimate num keys of each column families. The metric has 2 labels db and cf.
RocksDBtikv_engine_compaction_flow_bytesBytes of read/write during compaction
RocksDBtikv_engine_bytes_per_readHistogram of bytes per read. The metric has 2 labels db and type.
RocksDBtikv_engine_read_amp_flow_bytesBytes of read amplification. The metric has 2 labels db and type.
RocksDBtikv_engine_bytes_per_writetikv_engine_bytes_per_write. The metric has 2 labels db and type.
RocksDBtikv_engine_num_snapshotsNumber of unreleased snapshots. The metric has a label db.
RocksDBtikv_engine_pending_compaction_bytesPending compaction bytes. The metric has 2 labels db and cf.
RocksDBtikv_engine_num_files_at_levelNumber of files at each level. The metric has 3 labels db, cf and level.
RocksDBtikv_engine_compression_ratioCompression ratio at different levels. The metric has 3 labels db, cf and level.
RocksDBtikv_engine_oldest_snapshot_durationOldest unreleased snapshot duration in seconds. The metric has a label db.
RocksDBtikv_engine_write_stall_reasonQPS of each reason which causes TiKV write stall. The metric has 2 labels db and type.
RocksDBtikv_engine_memory_bytesSizes of each column families. The metric has 3 labels db, cf and type.