Monitoring a Cluster
Learn how to monitor the state of a TiKV cluster.
Currently, you can use two types of interfaces to monitor the state of the TiKV cluster:
- The component state interface: use the HTTP interface to get the internal information of a component, which is called the component state interface.
- The metrics interface: use the Prometheus interface to record the detailed information of various operations in the components, which is called the metrics interface.
The component state interface
You can use this type of interface to monitor the basic information of components. This interface can get the details of the entire TiKV cluster and can act as the interface to monitor Keepalive.
The PD server
The API address of the Placement Driver (PD) is http://${host}:${port}/pd/api/v1/${api_name}
The default port number is 2379.
For detailed information about various API names, see PD API doc.
You can use the interface to get the state of all the TiKV instances and the information about load balancing. It is the most important and frequently-used interface to get the state information of all the TiKV nodes. See the following example for the information about a 3-instance TiKV cluster deployed on a single machine:
curl http://127.0.0.1:2379/pd/api/v1/stores
{
"count": 3,
"stores": [
{
"store": {
"id": 1,
"address": "127.0.0.1:20161",
"version": "2.1.0-rc.2",
"state_name": "Up"
},
"status": {
"capacity": "937 GiB",
"available": "837 GiB",
"leader_weight": 1,
"region_count": 1,
"region_weight": 1,
"region_score": 1,
"region_size": 1,
"start_ts": "2018-09-29T00:05:47Z",
"last_heartbeat_ts": "2018-09-29T00:23:46.227350716Z",
"uptime": "17m59.227350716s"
}
},
{
"store": {
"id": 2,
"address": "127.0.0.1:20162",
"version": "2.1.0-rc.2",
"state_name": "Up"
},
"status": {
"capacity": "937 GiB",
"available": "837 GiB",
"leader_weight": 1,
"region_count": 1,
"region_weight": 1,
"region_score": 1,
"region_size": 1,
"start_ts": "2018-09-29T00:05:47Z",
"last_heartbeat_ts": "2018-09-29T00:23:45.65292648Z",
"uptime": "17m58.65292648s"
}
},
{
"store": {
"id": 7,
"address": "127.0.0.1:20160",
"version": "2.1.0-rc.2",
"state_name": "Up"
},
"status": {
"capacity": "937 GiB",
"available": "837 GiB",
"leader_count": 1,
"leader_weight": 1,
"leader_score": 1,
"leader_size": 1,
"region_count": 1,
"region_weight": 1,
"region_score": 1,
"region_size": 1,
"start_ts": "2018-09-29T00:05:47Z",
"last_heartbeat_ts": "2018-09-29T00:23:44.853636067Z",
"uptime": "17m57.853636067s"
}
}
]
}
The metrics interface
You can use this type of interface to monitor the state and performance of the entire cluster. The metrics data is displayed in Prometheus and Grafana. See Use Prometheus and Grafana for how to set up the monitoring system.
You can get the following metrics for each component:
The PD server
- the total number of times that the command executes
- the total number of times that a certain command fails
- the duration that a command succeeds
- the duration that a command fails
- the duration that a command finishes and returns result
The TiKV server
- Garbage Collection (GC) monitoring
- the total number of times that the TiKV command executes
- the duration that Scheduler executes commands
- the total number of times of the Raft propose command
- the duration that Raft executes commands
- the total number of times that Raft commands fail
- the total number of times that Raft processes the ready state
Use Prometheus and Grafana
This section introduces the deployment architecture of Prometheus and Grafana in TiKV, and how to set up and configure the monitoring system.
The deployment architecture
See the following diagram for the deployment architecture:
Set up the monitoring system
You can use your existing Prometheus server to pull metrics from TiKV, or quickly bootstrap a test server below. This pairs well with Grafana for monitoring.
Use the following script to bootstrap Grafana and Prometheus on a local CentOS system:
cat <<EOT > /etc/yum.repos.d/prometheus.repo
[prometheus]
name=prometheus
baseurl=https://packagecloud.io/prometheus-rpm/release/el/\$releasever/\$basearch
repo_gpgcheck=1
enabled=1
gpgkey=https://packagecloud.io/prometheus-rpm/release/gpgkey
https://raw.githubusercontent.com/lest/prometheus-rpm/master/RPM-GPG-KEY-prometheus-rpm
gpgcheck=1
metadata_expire=300
EOT
cat <<EOT > /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOT
yum install -y prometheus grafana
systemctl enable --now prometheus grafana-server
Browse to port 3000 on the host and you should be able to log in with admin
/admin
.
Configure Prometheus
To configure Prometheus, edit the /etc/prometheus/prometheus.yml
and the scrape_configs
block:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'test-cluster'
monitor: "prometheus"
scrape_configs:
- job_name: 'pd'
honor_labels: true
static_configs:
- targets:
- '127.0.0.1:2379'
- job_name: 'tikv'
honor_labels: true
static_configs:
- targets:
- '127.0.0.1:20160'
Restart Prometheus with systemctl restart prometheus
.
Configure Grafana
Visiting Add a Data Source, choose Prometheus and add the URL or your Prometheus server (http://127.0.0.1:9090
). Finally, save & test it.
- Log in to the Grafana Web interface (default login
admin
/admin
). - Visit Add a data source.
- Specify the data source information:
- Specify the name for the data source.
- For Type, select Prometheus.
- For Url, specify the Prometheus address.
- Specify other fields as needed.
- Click ‘Test & Save’ to save the new data source.
Grafana Dashboard Templates
The TiKV community has developed several useful dashboards to get you started with Grafana. You may need to adjust some settings, particularly if you configured the external_labels
option in the prometheus.yml
configuration above.